Abstract

With the widespread use of large language models (LLMs), many researchers have turned their attention to detecting text generated by them. However, there is no consistent or precise definition of their target, namely “LLM-generated text”. Differences in usage scenarios and the diversity of LLMs further increase the difficulty of detection. What is commonly regarded as the detecting target usually represents only a subset of the text that LLMs can potentially produce. Human edits to LLM outputs, together with the subtle influences that LLMs exert on their users, are blurring the line between LLM-generated and human-written text. Existing benchmarks and evaluation approaches do not adequately address the various conditions in real-world detector applications. Hence, the numerical results of detectors are often misunderstood, and their significance is diminishing. Therefore, detectors remain useful under specific conditions, but their results should be interpreted only as references rather than decisive indicators.

1 Introduction↩︎

The rapid development of large language models (LLMs) has led to a rise of LLM-generated text observed across various fields, such as academia [1], [2] and Wikipedia [3], [4]. The use of AI may also be associated with negative judgments (e.g., low-quality works and lack of creativity), which has become a social phenomenon [5]. Thus, the detection of LLM-generated text is not merely a technical matter, but also a social issue.

A lot of detectors have been proposed and studied [6], [7], while various issues and challenges have also been examined [7]–[10]. The reliability of the detectors has also been widely discussed, for example, their robustness against attacks and their use in different situations [11]–[13]. Similar to [13]’s paper, we explore the following question:

Is it possible to detect the LLM-generated text in practice?

Researchers often say that the text generated by LLMs is different from that written by humans [14]–[16]. The indistinguishability between LLM-generated and human-written text is one of big challenges for LLMs [17]. However, a basic problem has not been adequately addressed in current literature:

What exactly is “LLM-generated text”?

A simpler and more direct example is that the output of LLMs will vary when given different prompts. Just as different people can write in various styles [18], different LLMs generate text in different styles [19], [20].

The broad and vague definitions of LLM-generated text have resulted in inconsistent benchmarks and evaluations, which, in turn, has created new ethical issues and challenges in practical use. If we take into account the diversity of LLMs and human-in-the-loop, the situation becomes even more intricate, and few detectors have tried to distinguish them [21]. It has shown that humans possess the capacity to identify LLM-generated text with reasonable accuracy [22], and the coevolution between humans and LLMs may have occurred [23]. Empirical studies have consistently demonstrated that different LLMs exhibit distinct stylistic patterns fingerprints, which could even be used for classifiction [20], [24], [25]. The continuous evolution and adjustment of watermarking and attacks are also related to its definition, adding further complexity to the issue of detectability. LLMs may be fully capable of generating text that is nearly indistinguishable from human-written content in the future.

Hence, in many practical cases, we think that it is not possible to accurately detect LLM-generated text. We will discuss the detectability of LLM-generated text in detail in the following sections.

2 Definition and Usage↩︎

As we mentioned earlier, what exactly is the detection target: “LLM-generated text”? Researchers have referred to it using expressions like “machine-generated text” or “AI-generated”, as listed in Table 1. For simplicity, we use “LLM-generated text” to represent the subject of study in this paper, as it is more precise than the other expressions.

Table 1: Definition of LLM-generated text in different papers
Paper	Definition
[26]	“Machine-generated text” is natural language text that is produced, modified, or extended by a machine.
[27]	In this survey, we define AI-generated text as output produced by a natural language generation pipeline employing a neural probabilistic language model.
[7]	LLM-generated Text is defined as cohesive, grammatically sound, and pertinent content generated by LLMs.

Apart from slight differences in terminology, the definition of “LLM-generated text” in most papers is quite broad, meaning the text can be produced in many ways using LLMs, like paraphrasing, translation, or generating long text from simple prompts. Theoretically, the difference between “LLM-generated text” and “human-written text” lies in how they are produced, as implied by the names. But in practice, we can only evaluate them based on their final output i.e., the text, where a considerable overlap exists between them.

We think that what is commonly regarded as the detecting target usually represents only a subset of the text that LLMs can potentially produce, and it’s likely the kind that corresponds to the most common and direct prompts. For instance, many detectors are trained on text generated by LLMs, which cannot represent all possibilities. Consequently, their detection capabilities are constrained. While different parameters can be set for various types of cases [28], such configurations can hardly cover all possible scenarios.

In addition to the examples given earlier, the traces of LLM-generated text have now been found in various fields, such as student essays’ answers [29], words used in speaking [30]–[32] and numerous online texts [33]–[35]. We can even say that, LLM-generated text has now integrated into all areas of life, mixing with human writing.

While limitations of these detection methods have caused concern among researchers [11], [36]–[39], they could be applied in diverse contexts. Such detection tools are often promoted for their potential utility in identifying instances of plagiarism [40], academic dishonesty (e.g., cheating during examinations) [41], the automatic generation of unethical peer reviews [42], and other forms of content manipulation.

Therefore, evaluating and comparing the effectiveness of different detectors presents a significant challenge.

3 Background↩︎

Probably most people became aware of LLMs after the release of ChatGPT, but the research on detecting text generated by language models had started before that. For example, [43] proposed the GLTR (Giant Language model Test Room) tool to detect whether text was generated by models, with experiments involving GPT-2 [44] and BERT [45]. [46] developed the Grover model to detect AI-generated fake news. Even GPT-3 [47] continued to face skepticism regarding its text-generation capabilities [48], making the detectors’ performance unsurprising. Another pioneering work by [49] demonstrated that humans have already encountered some difficulties in identifying texts generated by GPT-2. Later, [50] found that untrained people at the time were not very good at recognizing text produced by GPT-3, and [51] noticed the similar situation for machine-paraphrased plagiarism.

The rapid development and spread of LLMs has drawn significant attention from researchers to the detection of LLM-generated text, and diverse methods have been proposed [7]: DetectGPT [52], Fast-DetectGPT [53], DetectLLM [54], LLMDet [55], DeID-GPT [56] and some others in 2023 [57], [58]; Binoculars [28], TOCSIN [59], Dpic [60], Text Fluoroscopy [61] in 2024, some more detecting methods in 2025 [62], [63]. The examples listed above are illustrative, and the actual number of detectors is much larger.

These detection methods can be classified into many categories according to different criteria. For instance, [10] classifies them as supervised methods, zero-shot methods, retrieval-based methods, watermarking methods, discriminating features. [7] mainly examines them through the lens of watermarking techniques, statistics-based detectors, neural-based detectors, and human-assisted methods. Specialized detectors have been developed, for instance, targeting tweets [64], [65], journalistic news articles [66] and Wikipedia [67]. Additionally, the detection of LLM-generated text is not limited to English [68]. Detectors that are used in other languages have also been brought up, such as French [69], Japanese [70], Chinese [71].

There are other ways to categorize the detectors. For example, most studies only think about binary classification, and detectors with multi-category cases have not been comprehensively explored. In spite of the wide range of methods utilized by these detectors, the absence of universal benchmarks and different application scenarios limits a meaningful comparison [72]. Although these detection techniques performed well earlier on certain datasets, the ongoing progress of LLMs also makes detection harder [7].

Researchers have not yet reached full agreement on the detectability of LLM-generated text. For instance, [13] claim in their position paper: “Despite ongoing debate about the feasibility of such differentiation, we present evidence supporting its consistent achievability, except when human and machine text distributions are indistinguishable across their entire support. Drawing from information theory, we argue that as machine-generated text approximates human-like quality, the sample size needed for detection increases.” While [73] argue that "We believe that the issue of AI-generated text detection remains an unresolved challenge. As LLMs become increasingly powerful and humans become more proficient in using them, it is even less likely to detect AI text in the future." [38] “advise against continued reliance on LLM-generated text detectors”, and [74] try to show that “perfect AI detection is mathematically impossible”.

The key disagreement among researchers may not be technical in nature, but rather stems from differing perspectives on human intervention and usage scenarios. The assumptions made by different researchers on this issue are not the same. Take watermarking studies as an example, if people edit the generated text (which is simple to do), the watermark’s reliability may be greatly weakened [75].

4 Evaluation and Benchmark↩︎

Now, we will focus on a relatively simple scenario, i.e., we consider that “LLM-generated text” refers to the content produced by LLMs through normal prompts.

As outlined earlier, there have always been many doubts about the effectiveness of these detectors [11], [76]. The sensitivity of detectors and the problem of false positives have been widely criticized and examined in various papers [76]–[79]. By way of illustration, a detection accuracy of 85% necessarily implies a 15% error rate, which may include both false positives and false negatives. [80] find that the AI text detector erroneously identified up to 8% of the known real abstracts as AI-generated text, and [81] states that false positives disproportionately affect non-native English speakers and scholars with distinctive writing styles.

Studies indicate that the detectability of texts depends on the LLM used for text generation [82]. For instance, [83] point out that the Binoculars [28] only achieved a 55.15% AUROC in detecting texts generated by Claude, while for texts generated by several other models, it reached at least 88%. Given the diversity of usage scenarios and the variety of LLMs, such results are to be expected and no existing LLM-detection system is infallible.

Besides, most studies focus on the binary classification problem of determining whether a given text is generated by LLMs. While some detection methods could achieve good results on given datasets, the scenario becomes more much complicated in real-world settings. To exemplify, people could edit LLM-generated text or mix it with human written text, which has also attracted considerable attention [42], [84]–[86]. Human contribution also affects detection results [87]. Only a small number of researchers have tried to identify specific roles of LLM in content generation [21], and no universally accepted approaches have been established.

Another challenge in establishing benchmarks for detecting LLM-generated text is that LLMs are continuously evolving, and their characteristics do not remain the same. To cite an example, [88] create their benchmark with GPT-2, which should be quite differently from the current advanced LLMs. Although LLMs were compared to stochastic parrot [48] a couple of years ago, their capabilities are gradually being recognized [89], and now their competencies have reached or even surpassed those of human experts in various fields. This leads us to speculate that future LLM-generated texts could surpass current versions in human resemblance. The persona effect could be a good example [90]–[92].

Some researchers have shown that detectors may more easily flag text from older and smaller models [39], [93]. The performance of detection systems varies with the application scenario; for instance, they tend to be less effective when dealing with short texts [94], [95]. More LLMs were employed in subsequent benchmark construction [96]–[98], but the number of prompts and scenarios used was limited. Some recent benchmarks [83], [99] have incorporated a broader range of scenarios, and their impact and effectiveness remain to be seen. There is a similar issue with datasets. For example, [100] found that all analyzed datasets fail in one or another of their methods and do not allow to reliably estimate AI detectors.

Therefore, their adoption has not yet become as widespread as other well-known LLM benchmarks, such as GLUE [101] and MMLU [102]. Although these benchmarks have also faced some criticism [103], there is still no highly universal benchmark to detect LLM-generated texts. Researchers have also started developing shared benchmarks to tackle this problem [72]. Nevertheless, due to the diversity of usage scenarios and the continuous updates of LLMs, a gold-standard benchmark is hard to realize, may even remain permanently absent.

5 Attack, Watermarking, and Evolution↩︎

The brittleness of these tools is demonstrated by many realistic scenarios, including various attacks. An early study has shown that while humans can reliably detect poetry produced by GPT-2, but they struggle to accurately recognize it after human selection [104]. If post-generation modifications are taken into account, the detection process should become more challenging. For example, although some recent systems claim to maintain performance when the LLM-generated text is lightly modified, empirical evidence shows that detection accuracy tends to decline as the extent of human revision increases [105].

Studies on how to attack these detectors are also being conducted in different views, such as paraphrasing [11], [106], adversarial prompting [79], [107], and other forms of authorship obfuscation [108]. [79] think that adversarial attacks can easily bypass these detectors, and balancing high sensitivity with a reasonable true positive rate remains challenging. [109] found that even minor adjustments to decoding parameters (such as temperature and top-p) can severely impair detector accuracy. The text produced by the fine-tuned model could also be more challenging to detect [110]. Some subtle differences could also affect the detectors’ results, such as a single space [111] or some minor perturbations [112].

To address the detection problem at its source, researchers have already begun experimenting with watermarking methods, which has shown good effectiveness in simulation [113] and may also be an ethical necessity [114]. Diverse watermarking techniques have been proposed, in theory [115] and in practice [75]. In addition, some researchers have pointed out the potential risk of watermark stealing [116], and studies on adversarial attacks against watermarking are ongoing [117].

While there are specific solutions for certain attacks [118], [119], in practice, there are numerous methods of attack. Consequently, the tools’ applicability to real-world use cases remains limited. This limitation exacerbates the concerns raised in the first point, as reliance on imperfect systems in nuanced or ambiguous situations increases the likelihood of erroneous judgments.

And some of these difficulties are simply unavoidable, not merely temporary technical challenges. As noted by several researchers, humans can learn to detect AI-generated texts [120] and can become an accurate and robust detector of LLM-generated text [22]. When LLMs become more widely used and people may learn from their outputs, the difference between them may get further smaller. We can find more cases of coevolution [23], [121] between human and machines. Therefore, the gap between LLM-generated and human-written text is expected to narrow further, but we think that they will not become entirely identical.

Consequently, in the future, detecting LLM-generated text may become less important, particularly in terms of numerical interpretation. While the hybrid human-in-the-loop model is a possible option for detectors, it should be regarded as one way to mitigate the problem, not a perfect solution.

6 Ethical Considerations↩︎

The social impact of of LLMs has already been considered [122]. People began discussing their positive impact shortly after its emergence [123]. LLMs can play a valuable role in numerous contexts, such as enhancing productivity [124]. LLMs can help bridge linguistic divides by enabling non-native speakers to produce coherent and idiomatic texts in English or other target languages, thereby supporting greater inclusivity in academic and professional communication. Non-native English speakers have to put in more effort as scientists, and there has been discrimination in the past [125], [126]. Automatic editing methods have shown promise in improving alignment between LLM-generated and human-written text [127]. LLMs may have the potential to serve as effective tutors in English Education [128].

However, text generated with LLMs is frequently the subject of criticism [5], particularly due to concerns about academic dishonesty, plagiarism, and fraud, which has led to the development of various detection tools [7], [51], [129]. As we have discussed, the definition of LLM-generated text is broad and lacks consistency. In some cases, the LLM acts as a language aid, and the substantive intellectual work remains the responsibility of human authors. It is therefore natural to ask the following question:

Should we use these detectors?

In addition to the technical challenges we have previously discussed, the use of detectors requires consideration of possible ethical risks. A non-native speaker might use an LLM to translate, rephrase, or refine a manuscript that is otherwise the product of original research. But experiments show that certain detectors may exhibit bias against non-native English writers [36] or against certain demographic groups [130]. The analyses from [91] reveal that all the detectors they tested are highly sensitive to CEFR level and language environment. Signs of language discrimination and resistance have also been noted in recently [126].

As such, current detectors are far from perfect and may never achieve perfection in the future either. Accusing someone of misconduct based on an imperfect tool can lead to unjust outcomes, reputational damage, and institutional distrust. Therefore, even detectors with relatively high accuracy present significant risks when used for evaluative or disciplinary purposes. With LLMs being so widely used in academia [22], [131], detecting AI-generated text must be handled with extreme care.

In addition to the detection and policing of LLM-generated text, it may also be useful to advocate for transparency regarding their use [132] and promote AI literacy [133]. In academic publishing, for example, it is increasingly common to disclose how LLMs have assisted in drafting, editing, or rephrasing portions of a manuscript. Clear guidelines and disclosures can thus help normalize the ethical integration of LLMs into scholarly workflows without undermining academic integrity.

If LLMs are applied properly and people assess detection tools reasonably, their positive influence can be greatly amplified. Those familiar with LLMs and detectors are aware of the potential issues, but the public tends to be easily drawn to these numbers and the brief conclusion. The mechanism behind the language style of LLMs is currently unclear [134]. The lack of detector interpretability represents another concern [135], severely limiting the ability to provide transparent explanations to the public.

7 Case Study↩︎

We have previously reviewed numerous problems with the detection results of LLM-generated text, and new studies keep confirming these issues [136]–[138]. Therefore, we will illustrate this with simple examples using different LLMs (DeepSeek-V3.2 [139], DeepSeek-R1 [140], GPT-3.5, GPT-4o-mini, GPT-4o [141]). The prompts employed are as follows:

(P1) Polish the following passage (provide only the result): …
(P2) Rewrite the following passage (provide only the result): …
(P3) Please polish the following passage (provide only the result): …
(P4) Please rewrite the following passage (provide only the result): …

The original text (the first paragraph of Alan Turing’s famous paper Computing Machinery and Intelligence [142]) and a processed example are shown below. More results are provided in the Appendix 9.

7.0.0.1 Original Text

I propose to consider the question, "Can machines think?" This should begin with definitions of the meaning of the terms "machine" and "think." The definitions might be framed so as to reflect so far as possible the normal use of the words, but this attitude is dangerous, If the meaning of the words "machine" and "think" are to be found by examining how they are commonly used it is difficult to escape the conclusion that the meaning and the answer to the question, "Can machines think?" is to be sought in a statistical survey such as a Gallup poll. But this is absurd. Instead of attempting such a definition I shall replace the question by another, which is closely related to it and is expressed in relatively unambiguous words.

7.0.0.2 GPT-4o & (P1):

I propose to consider the question, "Can machines think?" This should begin with defining the terms "machine" and "think." While these definitions might aim to reflect the common usage of the words, this approach is risky. If we determine the meanings of "machine" and "think" based on their everyday use, we might conclude that the answer to "Can machines think?" should be derived from a statistical survey, like a Gallup poll. However, this is absurd. Instead of attempting such a definition, I will replace the question with another, closely related one, expressed in relatively unambiguous terms.

Table 2 presents the results of detecting these texts using Fast-DetectGPT [53]. Even though these texts are all generated by LLMs, their detection outcomes vary widely. It is not difficult to see that in many cases, compared with the original text, the text processed by LLMs is considered by the detector to appear less machine-generated.

Table 2: Detection results using Fast-DetectGPT. The two values in each cell correspond to the predictions of the machine-generated results when the Sampling/scoring model is gpt-neo-2.7b and falcon-7b, respectively.
Prompts	(Original text)	P1	P2	P3	P4
GPT-3.5	33% / 17%	27% / 14%	40% / 42%	79% / 89%	22% / 23%
GPT-4o-mini		62% / 47%	27% / 14%	44% / 34%	27% / 14%
GPT-4o		26% / 11%	28% / 18%	24% / 12%	19% / 11%
DeepSeek-V3.2		64% / 54%	50% / 79%	32% / 18%	60% / 56%
DeepSeek-R1		43% / 23%	72% / 81%	33% / 13%	24% / 13%

These are merely a few basic examples of the issues and limitations faced by LLM-generated text detectors. Obviously, the same LLM can produce different texts in response to different prompts for the same task, as we have shown before. Although these detectors may still be applicable in certain scenarios, their use requires greater caution.

8 Conclusions↩︎

We must highlight that a major source of difficulty is that LLM-generated text lacks a unified and clear definition. Moreover, due to the various influences of LLMs on humans, human-written text is likely to converge toward the LLM-generated text. The detection of LLM-generated text can be achieved under multiple assumptions, but in reality, these assumptions are often difficult to satisfy. The abuse and misuse of these detectors can create many risks.

The detectors usually struggle to assess the proportion, function, or ethical significance of LLM contributions in a given text. As a result, the development and deployment of LLM-generated text detectors raise serious concerns, not only due to technical limitations but also because they risk enforcing overly simplistic and potentially unjust frameworks for evaluating authorship and intellectual responsibility.

The numerical effectiveness of LLM-generated text detectors is declining. On the one hand, human may be influenced by LLMs and may create text resembling LLM-generated text. One the other hand, people may also adapt their language to bypass LLM detection tools [23]. Detection efforts should target substantive content (e.g., fact-checking) rather than linguistic characteristics [143].

Therefore, while these detectors can be useful under specific conditions, their results should be used only as a reference with great caution and not as a decisive indicator. For instance, when interpreting the detection results of LLM-generated text, it is necessary to explicitly state the assumptions and prerequisites.

Acknowledgements↩︎

This work is supported in part by the ENS-PSL BeYs Chair in Data Science and Cybersecurity. It was also funded in part by the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute).

9 Simulation Results↩︎

The temperature parameter is set to 0 in the simulations.

GPT-3.5-turbo & Polish the following passage (provide only the result): I propose considering the question, "Can machines think?" This should start with defining the terms "machine" and "think." Definitions should reflect the common use of the words, but this approach is risky. If we try to find the meaning of "machine" and "think" by looking at how they are commonly used, it seems that the answer to the question must be sought in a statistical survey like a Gallup poll. However, this is absurd. Instead of trying to define it in this way, I will replace the question with another closely related one, expressed in clearer terms.
GPT-4o-mini & Polish the following passage (provide only the result): I propose to explore the question, "Can machines think?" To begin, we must define the terms "machine" and "think." While these definitions could be framed to reflect their common usage, this approach is fraught with danger. Relying solely on how these words are typically used leads to the troubling conclusion that the meaning of "machine" and "think," as well as the answer to the question, "Can machines think?" could be determined through a statistical survey, such as a Gallup poll. This notion is absurd. Rather than pursuing such definitions, I will rephrase the question into a closely related one that is expressed in clearer terms.
GPT-4o & Polish the following passage (provide only the result): I propose to consider the question, "Can machines think?" This should begin with defining the terms "machine" and "think." While these definitions might aim to reflect the common usage of the words, this approach is risky. If we determine the meanings of "machine" and "think" based on their everyday use, we might conclude that the answer to "Can machines think?" should be derived from a statistical survey, like a Gallup poll. However, this is absurd. Instead of attempting such a definition, I will replace the question with another, closely related one, expressed in relatively unambiguous terms.
GPT-3.5-turbo & Rewrite the following passage (provide only the result):The question of whether machines can think should not be based on common usage of the terms "machine" and "think." Instead, I will propose a different question that is closely related and more clearly defined.
GPT-4o-mini & Rewrite the following passage (provide only the result):I propose to explore the question, "Can machines think?" This exploration should start with defining the terms "machine" and "think." While these definitions could aim to reflect their typical usage, this approach is fraught with risks. Relying on common usage to determine the meanings of "machine" and "think" leads to the troubling conclusion that the answer to "Can machines think?" might be found through a statistical survey, like a Gallup poll. This notion is absurd. Rather than pursuing such definitions, I will reframe the question with another that is closely related and articulated in clearer terms.
GPT-4o & Rewrite the following passage (provide only the result):I propose to explore the question, "Can machines think?" by first defining the terms "machine" and "think." While these definitions should ideally reflect the common usage of the words, this approach is risky. If we determine the meanings of "machine" and "think" based on their typical usage, we might conclude that the answer to "Can machines think?" should be derived from a statistical survey like a Gallup poll, which is unreasonable. Therefore, instead of trying to define these terms, I will reframe the question into another that is closely related and expressed in clearer terms.
GPT-3.5-turbo & Please polish the following passage (provide only the result): I propose to explore the question, "Can machines think?" This exploration will begin with defining the terms "machine" and "think." While it may be tempting to rely on common usage for these definitions, this approach is risky. If we simply look at how these words are commonly used, we may be led to the mistaken belief that the answer to the question lies in a statistical survey like a Gallup poll. However, this is clearly illogical. Rather than pursuing such a definition, I will instead pose a related question that is more clearly defined and easier to address.
GPT-4o-mini & Please polish the following passage (provide only the result): I propose to explore the question, "Can machines think?" This inquiry should begin with clear definitions of the terms "machine" and "think." While these definitions could be framed to reflect common usage, this approach is fraught with challenges. Relying solely on how these words are typically used may lead to the conclusion that the answer to "Can machines think?" can be determined through a statistical survey, such as a Gallup poll. However, this notion is absurd. Rather than pursuing such definitions, I will rephrase the question into a closely related one that is expressed in more precise terms.
GPT-4o & Please polish the following passage (provide only the result): I propose to consider the question, "Can machines think?" This should begin with defining the terms "machine" and "think." While these definitions might aim to reflect the common usage of the words, this approach is risky. If we determine the meanings of "machine" and "think" by examining their everyday use, we might conclude that the answer to "Can machines think?" should be sought through a statistical survey, like a Gallup poll. However, this is absurd. Instead of attempting such a definition, I will replace the question with another, closely related one, expressed in relatively unambiguous terms.
GPT-3.5-turbo & Please rewrite the following passage (provide only the result): The question of whether machines can think can be reframed as a different question that is more clear and closely related.
GPT-4o-mini & Please rewrite the following passage (provide only the result): I propose to explore the question, "Can machines think?" This exploration should start with defining the terms "machine" and "think." While these definitions could aim to reflect their typical usage, this approach is fraught with risks. Relying on common usage to determine the meanings of "machine" and "think" leads to the troubling conclusion that the answer to "Can machines think?" might be found through a statistical survey, like a Gallup poll. This notion is absurd. Rather than pursuing such definitions, I will reframe the question with another that is closely related and articulated in clearer terms.
GPT-4o & Please rewrite the following passage (provide only the result): I suggest we examine the question, "Can machines think?" by first defining the terms "machine" and "think." While these definitions should ideally reflect common usage, relying solely on this approach is risky. If we determine the meanings of "machine" and "think" based on their typical usage, we might conclude that the answer to "Can machines think?" should be derived from a statistical survey like a Gallup poll, which is clearly unreasonable. Therefore, rather than attempting such a definition, I will substitute the question with another one that is closely related and expressed in clearer terms.
DeepSeek-V3.2 & Polish the following passage (provide only the result): I propose to consider the question, "Can machines think?" To do so, one might begin by defining the terms "machine" and "think." These definitions could be framed to reflect the normal use of the words as far as possible, but this approach is problematic. If we seek the meaning of "machine" and "think" by examining common usage, we are led to the conclusion that the meaning—and the answer to the question—must be determined by a statistical survey, such as a Gallup poll. This, however, is absurd. Rather than attempting such a definition, I will replace the original question with another, closely related to it and expressed in relatively unambiguous words.
DeepSeek-R1 & Polish the following passage (provide only the result): I propose to consider the question, "Can machines think?" This should begin by defining the terms "machine" and "think." One might frame these definitions to reflect the normal use of the words, but this approach is perilous. If we seek the meaning of "machine" and "think" by examining common usage, we are led to the conclusion that the answer to our question must be sought through a statistical survey, such as a Gallup poll. This is absurd. Instead of attempting such a definition, I shall replace the question with another, closely related one that is expressed in relatively unambiguous words.
DeepSeek-V3.2 & Rewrite the following passage (provide only the result):To address the question "Can machines think?" we must first define the terms "machine" and "think." Relying on common usage, however, is problematic, as it would reduce the inquiry to a mere statistical survey, which is an absurd approach. Therefore, I will not attempt such definitions but will instead replace the original question with a different, more precisely worded one that is closely related.
DeepSeek-R1 & Rewrite the following passage (provide only the result):To address the question "Can machines think?" we must first define the terms "machine" and "think." Using their common meanings, however, is problematic, as it would reduce the inquiry to a matter of statistical opinion, which is an unsatisfactory approach. Therefore, rather than defining these terms, I will replace the original question with a different one that is closely related but phrased more precisely.
DeepSeek-V3.2 & Please polish the following passage (provide only the result): I propose to address the question, "Can machines think?" To do so, one might begin by defining the terms "machine" and "think." Ideally, these definitions would reflect the ordinary usage of the words. However, this approach is problematic. If we seek the meaning of "machine" and "think" solely by observing common usage, we are led to the untenable conclusion that the answer must be determined through something like a statistical survey—a Gallup poll, for instance. This is plainly absurd. Rather than pursuing such definitions, I will instead replace the original question with another, closely related to it, but framed in more precise language.
DeepSeek-R1 & Please polish the following passage (provide only the result): I propose to address the question, "Can machines think?" This inquiry should logically begin by defining the terms "machine" and "think." While one might frame these definitions to reflect the words’ ordinary usage, this approach is problematic. If we seek the meaning of "machine" and "think" solely through common usage, we are led to the conclusion that the answer must be found via a statistical survey, like a Gallup poll. This, however, is absurd. Rather than attempting such a definition, I will replace the original question with another that is closely related but framed in more unambiguous terms.
DeepSeek-V3.2 & Please rewrite the following passage (provide only the result): To address the question "Can machines think?" we would first need to define the terms "machine" and "think." One could base these definitions on common usage, but this approach is problematic, as it would reduce the inquiry to a mere statistical survey of popular opinion—an absurd outcome. Rather than pursuing such definitions, I will instead substitute the original question with a different one that is closely related but framed in more precise language.
DeepSeek-R1 & Please rewrite the following passage (provide only the result): I intend to examine the question, "Can machines think?" which ordinarily would start with defining the terms "machine" and "think." While definitions should ideally align with common usage, this method is fraught with risk. If we derive meaning from how words are typically used, we are led to the untenable conclusion that the answer must be found through a statistical survey, such as a Gallup poll. This is clearly absurd. Instead of pursuing such definitions, I will replace the question with a related one that is expressed in more precise language.

References↩︎

[1]

Weixin Liang, Zachary Izzo, Yaohui Zhang, Haley Lepp, Hancheng Cao, Xuandong Zhao, Lingjiao Chen, Haotian Ye, Sheng Liu, Zhi Huang, et al. Monitoring ai-modified content at scale: A case study on the impact of chatgpt on ai conference peer reviews. arXiv preprint arXiv:2403.07183, 2024.

[2]

Mingmeng Geng and Roberto Trotta. Is chatgpt transforming academics’ writing style? arXiv preprint arXiv:2404.08627, 2024.

[3]

Creston Brooks, Samuel Eggert, and Denis Peskoff. The rise of ai-generated content in wikipedia. arXiv preprint arXiv:2410.08044, 2024.

[4]

Siming Huang, Yuliang Xu, Mingmeng Geng, Yao Wan, and Dongping Chen. Wikipedia in the era of llms: Evolution and risks. arXiv preprint arXiv:2503.02879, 2025.

[5]

Advait Sarkar. Ai could have written this: Birth of a classist slur in knowledge work. In Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, pages 1–12, 2025.

[6]

Xianjun Yang, Liangming Pan, Xuandong Zhao, Haifeng Chen, Linda Petzold, William Yang Wang, and Wei Cheng. A survey on detection of llms-generated content. arXiv preprint arXiv:2310.15654, 2023.

[7]

Junchao Wu, Shu Yang, Runzhe Zhan, Yulin Yuan, Lidia Sam Chao, and Derek Fai Wong. A survey on llm-generated text detection: Necessity, methods, and future directions. Computational Linguistics, pages 1–66, 2025.

[8]

Ruixiang Tang, Yu-Neng Chuang, and Xia Hu. The science of detecting llm-generated text. Communications of the ACM, 67 (4): 50–59, 2024.

[9]

Kathleen C Fraser, Hillary Dawkins, and Svetlana Kiritchenko. Detecting ai-generated text: Factors influencing detectability with current methods. arXiv preprint arXiv:2406.15583, 2024.

[10]

Sara Abdali, Richard Anarfi, CJ Barberan, and Jia He. Decoding the ai pen: Techniques and challenges in detecting ai-generated text. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 6428–6436, 2024.

[11]

Vinu Sankar Sadasivan, Aounon Kumar, Sriram Balasubramanian, Wenxiao Wang, and Soheil Feizi. Can ai-generated text be reliably detected? arXiv preprint arXiv:2303.11156, 2023.

[12]

Soumya Suvra Ghosal, Souradip Chakraborty, Jonas Geiping, Furong Huang, Dinesh Manocha, and Amrit Singh Bedi. Towards possibilities & impossibilities of ai-generated text detection: A survey. arXiv preprint arXiv:2310.15264, 2023.

[13]

Souradip Chakraborty, Amrit Bedi, Sicheng Zhu, Bang An, Dinesh Manocha, and Furong Huang. Position: On the possibilities of ai-generated text detection. In Forty-first International Conference on Machine Learning, 2024.

[14]

Dongqi Pu and Vera Demberg. Chatgpt vs human-authored text: Insights into controllable text summarization and sentence style transfer. arXiv preprint arXiv:2306.07799, 2023.

[15]

Alberto Muñoz-Ortiz, Carlos Gómez-Rodrı́guez, and David Vilares. Contrasting linguistic patterns in human and llm-generated news text. Artificial Intelligence Review, 57 (10): 265, 2024.

[16]

Alex Reinhart, Ben Markey, Michael Laudenbach, Kachatad Pantusen, Ronald Yurko, Gordon Weinberg, and David West Brown. Do llms write like humans? variation in grammatical and rhetorical styles. Proceedings of the National Academy of Sciences, 122 (8): e2422455122, 2025.

[17]

Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, and Robert McHardy. Challenges and applications of large language models. arXiv preprint arXiv:2307.10169, 2023.

[18]

Jian Zhu and David Jurgens. Idiosyncratic but not arbitrary: Learning idiolects in online registers reveals distinctive yet consistent individual styles. arXiv preprint arXiv:2109.03158, 2021.

[19]

Ariel Rosenfeld and Teddy Lazebnik. Whose llm is it anyway? linguistic comparison and llm attribution for gpt-3.5, gpt-4 and bard. arXiv preprint arXiv:2402.14533, 2024.

[20]

Mingjie Sun, Yida Yin, Zhiqiu Xu, J Zico Kolter, and Zhuang Liu. Idiosyncrasies in large language models. arXiv preprint arXiv:2502.12150, 2025.

[21]

Zihao Cheng, Li Zhou, Feng Jiang, Benyou Wang, and Haizhou Li. Beyond binary: Towards fine-grained llm-generated text detection via role recognition and involvement measurement. In Proceedings of the ACM on Web Conference 2025, pages 2677–2688, 2025.

[22]

Jenna Russell, Marzena Karpinska, and Mohit Iyyer. People who frequently use chatgpt for writing tasks are accurate and robust detectors of ai-generated text. arXiv preprint arXiv:2501.15654, 2025.

[23]

Mingmeng Geng and Roberto Trotta. Human-llm coevolution: Evidence from academic writing. arXiv preprint arXiv:2502.09606, 2025.

[24]

Hope McGovern, Rickard Stureborg, Yoshi Suhara, and Dimitris Alikaniotis. Your large language models are leaving fingerprints. arXiv preprint arXiv:2405.14057, 2024.

[25]

Yehonatan Bitton, Elad Bitton, and Shai Nisan. Detecting stylistic fingerprints of large language models. arXiv preprint arXiv:2503.01659, 2025.

[26]

Evan N Crothers, Nathalie Japkowicz, and Herna L Viktor. Machine-generated text: A comprehensive survey of threat models and detection methods. IEEE Access, 11: 70977–71002, 2023.

[27]

Tharindu Kumarage, Garima Agrawal, Paras Sheth, Raha Moraffah, Aman Chadha, Joshua Garland, and Huan Liu. A survey of ai-generated text forensic systems: Detection, attribution, and characterization. arXiv preprint arXiv:2403.01152, 2024.

[28]

Abhimanyu Hans, Avi Schwarzschild, Valeriia Cherepanova, Hamid Kazemi, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Spotting llms with binoculars: Zero-shot detection of machine-generated text. arXiv preprint arXiv:2401.12070, 2024.

[29]

Leo Leppänen, Lili Aunimo, Arto Hellas, Jukka K Nurminen, and Linda Mannila. How large language models are changing mooc essay answers: A comparison of pre-and post-llm responses. arXiv preprint arXiv:2504.13038, 2025.

[30]

Hiromu Yakura, Ezequiel Lopez-Lopez, Levin Brinkmann, Ignacio Serna, Prateek Gupta, and Iyad Rahwan. Empirical evidence of large language model’s influence on human spoken communication. arXiv preprint arXiv:2409.01754, 2024.

[31]

Mingmeng Geng, Caixi Chen, Yanru Wu, Dongping Chen, Yao Wan, and Pan Zhou. The impact of large language models in academia: from writing to speaking. arXiv preprint arXiv:2409.13686, 2024.

[32]

Riley Galpin, Bryce Anderson, and Tom S Juzek. Exploring the structure of ai-induced language change in scientific english. arXiv preprint arXiv:2506.21817, 2025.

[33]

Zhen Sun, Zongmin Zhang, Xinyue Shen, Ziyi Zhang, Yule Liu, Michael Backes, Yang Zhang, and Xinlei He. Are we in the ai-generated text world already? quantifying and monitoring aigt on social media. arXiv preprint arXiv:2412.18148, 2024.

[34]

Weixin Liang, Yaohui Zhang, Mihai Codreanu, Jiayu Wang, Hancheng Cao, and James Zou. The widespread adoption of large language model-assisted writing across society. arXiv preprint arXiv:2502.09747, 2025.

[35]

Lucio La Cava, Luca Maria Aiello, and Andrea Tagarelli. Machines in the crowd? measuring the footprint of machine-generated text on reddit. arXiv preprint arXiv:2510.07226, 2025.

[36]

Weixin Liang, Mert Yuksekgonul, Yining Mao, Eric Wu, and James Zou. Gpt detectors are biased against non-native english writers. Patterns, 4 (7), 2023.

[37]

Jad Doughman, Osama Mohammed Afzal, Hawau Olamide Toyin, Shady Shehata, Preslav Nakov, and Zeerak Talat. Exploring the limitations of detecting machine-generated text. arXiv preprint arXiv:2406.11073, 2024.

[38]

Charlotte Nicks, Eric Mitchell, Rafael Rafailov, Archit Sharma, Christopher D Manning, Chelsea Finn, and Stefano Ermon. Language model detectors are easily optimized against. In The twelfth international conference on learning representations, 2023.

[39]

Shoumik Saha and Soheil Feizi. Almost ai, almost human: The challenge of detecting ai-polished writing. arXiv preprint arXiv:2502.15666, 2025.

[40]

Shushanta Pudasaini, Luis Miralles-Pechuán, David Lillis, and Marisa Llorens Salvador. Survey on plagiarism detection in large language models: The impact of chatgpt and gemini on academic integrity. arXiv preprint arXiv:2407.13105, 2024.

[41]

Quan Wang and Haoran Li. On continually tracing origins of llm-generated text and its application in detecting cheating in student coursework. Big Data and Cognitive Computing, 9 (3): 50, 2025.

[42]

Sandeep Kumar, Samarth Garg, Sagnik Sengupta, Tirthankar Ghosal, and Asif Ekbal. Mixrevdetect: Towards detecting ai-generated content in hybrid peer reviews. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages 944–953, 2025.

[43]

Sebastian Gehrmann, Hendrik Strobelt, and Alexander M Rush. Gltr: Statistical detection and visualization of generated text. arXiv preprint arXiv:1906.04043, 2019.

[44]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 (8): 9, 2019.

[45]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019.

[46]

Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. Defending against neural fake news. Advances in neural information processing systems, 32, 2019.

[47]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901, 2020.

[48]

Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021.

[49]

Daphne Ippolito, Daniel Duckworth, Chris Callison-Burch, and Douglas Eck. Automatic detection of generated text is easiest when humans are fooled. arXiv preprint arXiv:1911.00650, 2019.

[50]

Elizabeth Clark, Tal August, Sofia Serrano, Nikita Haduong, Suchin Gururangan, and Noah A Smith. All that’s’ human’is not gold: Evaluating human evaluation of generated text. arXiv preprint arXiv:2107.00061, 2021.

[51]

Jan Philip Wahle, Terry Ruas, Frederic Kirstein, and Bela Gipp. How large language models are transforming machine-paraphrased plagiarism. arXiv preprint arXiv:2210.03568, 2022.

[52]

Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D Manning, and Chelsea Finn. Detectgpt: Zero-shot machine-generated text detection using probability curvature. In International Conference on Machine Learning, pages 24950–24962. PMLR, 2023.

[53]

Guangsheng Bao, Yanbin Zhao, Zhiyang Teng, Linyi Yang, and Yue Zhang. Fast-detectgpt: Efficient zero-shot detection of machine-generated text via conditional probability curvature. arXiv preprint arXiv:2310.05130, 2023.

[54]

Jinyan Su, Terry Yue Zhuo, Di Wang, and Preslav Nakov. Detectllm: Leveraging log rank information for zero-shot detection of machine-generated text. arXiv preprint arXiv:2306.05540, 2023.

[55]

Kangxi Wu, Liang Pang, Huawei Shen, Xueqi Cheng, and Tat-Seng Chua. Llmdet: A third party large language models generated text detection tool. arXiv preprint arXiv:2305.15004, 2023.

[56]

Zhengliang Liu, Yue Huang, Xiaowei Yu, Lu Zhang, Zihao Wu, Chao Cao, Haixing Dai, Lin Zhao, Yiwei Li, Peng Shu, et al. Deid-gpt: Zero-shot medical text de-identification by gpt-4. arXiv preprint arXiv:2303.11032, 2023.

[57]

Liam Dugan, Daphne Ippolito, Arun Kirubarajan, Sherry Shi, and Chris Callison-Burch. Real or fake text?: Investigating human ability to detect boundaries between human-written and machine-generated text. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 12763–12771, 2023.

[58]

Biru Zhu, Lifan Yuan, Ganqu Cui, Yangyi Chen, Chong Fu, Bingxiang He, Yangdong Deng, Zhiyuan Liu, Maosong Sun, and Ming Gu. Beat llms at their own game: Zero-shot llm-generated text detection via querying chatgpt. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7470–7483, 2023.

[59]

Shixuan Ma and Quan Wang. Zero-shot detection of llm-generated text using token cohesiveness. arXiv preprint arXiv:2409.16914, 2024.

[60]

Xiao Yu, Yuang Qi, Kejiang Chen, Guoqiang Chen, Xi Yang, Pengyuan Zhu, Xiuwei Shang, Weiming Zhang, and Nenghai Yu. Dpic: Decoupling prompt and intrinsic characteristics for llm generated text detection. Advances in Neural Information Processing Systems, 37: 16194–16212, 2024.

[61]

Xiao Yu, Kejiang Chen, Qi Yang, Weiming Zhang, and Nenghai Yu. Text fluoroscopy: Detecting llm-generated text through intrinsic features. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 15838–15846, 2024.

[62]

Hongyi Zhou, Jin Zhu, Pingfan Su, Kai Ye, Ying Yang, Shakeel AOB Gavioli-Akilagun, and Chengchun Shi. Adadetectgpt: Adaptive detection of llm-generated text with statistical guarantees. arXiv preprint arXiv:2510.01268, 2025.

[63]

Cong Zeng, Shengkun Tang, Yuanzhou Chen, Zhiqiang Shen, Wenchao Yu, Xujiang Zhao, Haifeng Chen, Wei Cheng, and Zhiqiang Xu. Human texts are outliers: Detecting llm-generated texts via out-of-distribution detection. arXiv preprint arXiv:2510.08602, 2025.

[64]

Margherita Gambini, Tiziano Fagni, Fabrizio Falchi, and Maurizio Tesconi. On pushing deepfake tweet detection capabilities to the limits. In Proceedings of the 14th ACM Web Science Conference 2022, pages 154–163, 2022.

[65]

Tharindu Kumarage, Joshua Garland, Amrita Bhattacharjee, Kirill Trapeznikov, Scott Ruston, and Huan Liu. Stylometric detection of ai-generated text in twitter timelines. arXiv preprint arXiv:2303.03697, 2023.

[66]

Amrita Bhattacharjee, Tharindu Kumarage, Raha Moraffah, and Huan Liu. Conda: Contrastive domain adaptation for ai-generated text detection. arXiv preprint arXiv:2309.03992, 2023.

[67]

Gerrit Quaremba, Elizabeth Black, Denny Vrandečić, and Elena Simperl. Wetbench: A benchmark for detecting task-specific machine-generated text on wikipedia. arXiv preprint arXiv:2507.03373, 2025.

[68]

Yuxia Wang, Artem Shelmanov, Jonibek Mansurov, Akim Tsvigun, Vladislav Mikhailov, Rui Xing, Zhuohan Xie, Jiahui Geng, Giovanni Puccetti, Ekaterina Artemova, et al. Genai content detection task 1: English and multilingual machine-generated text detection: Ai vs. human. arXiv preprint arXiv:2501.11012, 2025.

[69]

Wissam Antoun, Virginie Mouilleron, Benoı̂t Sagot, and Djamé Seddah. Towards a robust detection of language model generated text: is chatgpt that easy to detect? arXiv preprint arXiv:2306.05871, 2023.

[70]

Wataru Zaitsu and Mingzhe Jin. Distinguishing chatgpt (-3.5,-4)-generated and human-written papers through japanese stylometric analysis. PLoS One, 18 (8): e0288453, 2023.

[71]

Rongsheng Wang, Haoming Chen, Ruizhe Zhou, Han Ma, Yaofei Duan, Yanlan Kang, Songhua Yang, Baoyu Fan, and Tao Tan. Llm-detector: Improving ai-generated chinese text detection with open-source llm instruction tuning. arXiv preprint arXiv:2402.01158, 2024.

[72]

Liam Dugan, Alyssa Hwang, Filip Trhlik, Josh Magnus Ludan, Andrew Zhu, Hainiu Xu, Daphne Ippolito, and Chris Callison-Burch. Raid: A shared benchmark for robust evaluation of machine-generated text detectors. arXiv preprint arXiv:2405.07940, 2024.

[73]

Yuehan Zhang, Yongqiang Ma, Jiawei Liu, Xiaozhong Liu, Xiaofeng Wang, and Wei Lu. Detection vs. anti-detection: Is text generated by ai detectable? In International Conference on Information, pages 209–222. Springer, 2024.

[74]

Aadil Gani Ganie. Uncertainty in authorship: Why perfect ai detection is mathematically impossible. arXiv preprint arXiv:2509.11915, 2025.

[75]

Sumanth Dathathri, Abigail See, Sumedh Ghaisas, Po-Sen Huang, Rob McAdam, Johannes Welbl, Vandana Bachani, Alex Kaskasoli, Robert Stanforth, Tatiana Matejovicova, et al. Scalable watermarking for identifying large language model outputs. Nature, 634 (8035): 818–823, 2024.

[76]

Debora Weber-Wulff, Alla Anohina-Naumeca, Sonja Bjelobaba, Tomáš Foltỳnek, Jean Guerrero-Dib, Olumide Popoola, Petr Šigut, and Lorna Waddington. Testing of detection tools for ai-generated text. International Journal for Educational Integrity, 19 (1): 1–39, 2023.

[77]

William H Walters. The effectiveness of software designed to detect ai-generated writing: A comparison of 16 ai text detectors. Open Information Science, 7 (1): 20220158, 2023.

[78]

Sujita Kumar Kar, Teena Bansal, Sumit Modi, and Amit Singh. How sensitive are the free ai-detector tools in detecting ai-generated texts? a comparison of popular ai-detector tools. Indian Journal of Psychological Medicine, page 02537176241247934, 2024.

[79]

Brian Tufts, Xuandong Zhao, and Lei Li. A practical examination of ai-generated text detectors for large language models. arXiv preprint arXiv:2412.05139, 2024.

[80]

Hooman H Rashidi, Brandon D Fennell, Samer Albahra, Bo Hu, and Tom Gorbett. The chatgpt conundrum: Human-generated scientific manuscripts misidentified as ai creations by ai text detection tool. Journal of Pathology Informatics, 14: 100342, 2023.

[81]

Louie Giray. The problem with false positives: Ai detection unfairly accuses scholars of ai plagiarism. The Serials Librarian, 85 (5-6): 181–189, 2024.

[82]

Wissam Antoun, Benoı̂t Sagot, and Djamé Seddah. From text to source: Results in detecting large language model-generated content. arXiv preprint arXiv:2309.13322, 2023.

[83]

Junchao Wu, Runzhe Zhan, Derek Wong, Shu Yang, Xinyi Yang, Yulin Yuan, and Lidia Chao. Detectrl: Benchmarking llm-generated text detection in real-world scenarios. Advances in Neural Information Processing Systems, 37: 100369–100401, 2024.

[84]

Qihui Zhang, Chujie Gao, Dongping Chen, Yue Huang, Yixin Huang, Zhenyang Sun, Shilin Zhang, Weiye Li, Zhengyan Fu, Yao Wan, et al. Llm-as-a-coauthor: Can mixed human-written and machine-generated text be detected? arXiv preprint arXiv:2401.05952, 2024.

[85]

Mervat Abassy, Kareem Elozeiri, Alexander Aziz, Minh Ngoc Ta, Raj Vardhan Tomar, Bimarsha Adhikari, Saad El Dine Ahmed, Yuxia Wang, Osama Mohammed Afzal, Zhuohan Xie, et al. Llm-detectaive: a tool for fine-grained machine-generated text detection. arXiv preprint arXiv:2408.04284, 2024.

[86]

Ram Mohan Rao Kadiyala, Siddartha Pullakhandam, Kanwal Mehreen, Drishti Sharma, Siddhant Gupta, Jebish Purbey, Ashay Srivastava, Subhasya Tippareddy, Arvind Reddy Bobbili, Suraj Telugara Chandrashekhar, et al. Robust and fine-grained detection of ai generated texts. arXiv preprint arXiv:2504.11952, 2025.

[87]

Lukas Gehring and Benjamin Paaßen. Assessing llm text detection in educational contexts: Does human contribution affect detection? arXiv preprint arXiv:2508.08096, 2025.

[88]

Vijini Liyanage, Davide Buscaldi, and Adeline Nazarenko. A benchmark corpus for the detection of automatically generated text in academic publications. arXiv preprint arXiv:2202.02013, 2022.

[89]

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.

[90]

Tiancheng Hu and Nigel Collier. Quantifying the persona effect in llm simulations. arXiv preprint arXiv:2402.10811, 2024.

[91]

Jiatao Li and Xiaojun Wan. Who writes what: Unveiling the impact of author roles on ai-generated text detection. arXiv preprint arXiv:2502.12611, 2025.

[92]

Lang Gao, Xuhui Li, Chenxi Wang, Mingzhe Li, Wei Liu, Zirui Song, Jinghui Zhang, Rui Yan, Preslav Nakov, and Xiuying Chen. When personalization tricks detectors: The feature-inversion trap in machine-generated text detection. arXiv preprint arXiv:2510.12476, 2025.

[93]

Ahmed M Elkhatat, Khaled Elsaid, and Saeed Almeer. Evaluating the efficacy of ai content detection tools in differentiating between human and ai-generated text. International Journal for Educational Integrity, 19 (1): 17, 2023.

[94]

Henrique Da Silva Gameiro, Andrei Kucharavy, and Ljiljana Dolamic. Llm detectors still fall short of real world: Case of llm-generated short news-like posts. arXiv preprint arXiv:2409.03291, 2024.

[95]

Yuhui Shi, Qiang Sheng, Juan Cao, Hao Mi, Beizhe Hu, and Danding Wang. Ten words only still help: Improving black-box ai-generated text detection via proxy-guided efficient re-sampling. arXiv preprint arXiv:2402.09199, 2024.

[96]

Yuxia Wang, Jonibek Mansurov, Petar Ivanov, Jinyan Su, Artem Shelmanov, Akim Tsvigun, Osama Mohanned Afzal, Tarek Mahmoud, Giovanni Puccetti, Thomas Arnold, et al. M4gt-bench: Evaluation benchmark for black-box machine-generated text detection. arXiv preprint arXiv:2402.11175, 2024.

[97]

Xinlei He, Xinyue Shen, Zeyuan Chen, Michael Backes, and Yang Zhang. Mgtbench: Benchmarking machine-generated text detection. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 2251–2265, 2024.

[98]

Joseph Cornelius, Oscar Lithgow-Serrano, Sandra Mitrović, Ljiljana Dolamic, and Fabio Rinaldi. Bust: Benchmark for the evaluation of detectors of llm-generated text. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8029–8057, 2024.

[99]

Zhen Tao, Zhiyu Li, Dinghao Xi, and Wei Xu. Cudrt: Benchmarking the detection of human vs. large language models generated texts. arXiv preprint arXiv:2406.09056, 2024.

[100]

German Gritsai, Anastasia Voznyuk, Andrey Grabovoy, and Yury Chekhovich. Are ai detectors good enough? a survey on quality of datasets with machine-generated texts. arXiv preprint arXiv:2410.14677, 2024.

[101]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.

[102]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.

[103]

Muhammad Usman Hadi, Rizwan Qureshi, Abbas Shah, Muhammad Irfan, Anas Zafar, Muhammad Bilal Shaikh, Naveed Akhtar, Jia Wu, Seyedali Mirjalili, et al. A survey on large language models: Applications, challenges, limitations, and practical usage. Authorea Preprints, 2023.

[104]

Nils Köbis and Luca D Mossink. Artificial intelligence versus maya angelou: Experimental evidence that people cannot differentiate ai-generated from human-written poetry. Computers in human behavior, 114: 106553, 2021.

[105]

Hiu Ting Lau and Arkaitz Zubiaga. Understanding the effects of human-written paraphrases in llm-generated text detection. arXiv preprint arXiv:2411.03806, 2024.

[106]

Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, and Mohit Iyyer. Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense. Advances in Neural Information Processing Systems, 36: 27469–27500, 2023.

[107]

Yize Cheng, Vinu Sankar Sadasivan, Mehrdad Saberi, Shoumik Saha, and Soheil Feizi. Adversarial paraphrasing: A universal attack for humanizing ai-generated text. arXiv preprint arXiv:2506.07001, 2025.

[108]

Dominik Macko, Robert Moro, Adaku Uchendu, Ivan Srba, Jason Samuel Lucas, Michiharu Yamashita, Nafis Irtiza Tripto, Dongwon Lee, Jakub Simko, and Maria Bielikova. Authorship obfuscation in multilingual machine-generated text detection. arXiv preprint arXiv:2401.07867, 2024.

[109]

Matthieu Dubois, François Yvon, and Pablo Piantanida. How sampling affects the detectability of machine-written texts: A comprehensive study. arXiv preprint arXiv:2510.13681, 2025.

[110]

Hillary Dawkins, Kathleen C Fraser, and Svetlana Kiritchenko. When detection fails: The power of fine-tuned models to generate human-like social media text. arXiv preprint arXiv:2506.09975, 2025.

[111]

Shuyang Cai and Wanyun Cui. Evade chatgpt detectors via a single space. arXiv preprint arXiv:2307.02599, 2023.

[112]

Ying Zhou, Ben He, and Le Sun. Humanizing machine-generated content: evading ai-text detection through adversarial attack. arXiv preprint arXiv:2404.01907, 2024.

[113]

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. In International Conference on Machine Learning, pages 17061–17084. PMLR, 2023.

[114]

Alexei Grinbaum and Laurynas Adomaitis. The ethical need for watermarks in machine-generated language. arXiv preprint arXiv:2209.03118, 2022.

[115]

Miranda Christ, Sam Gunn, and Or Zamir. Undetectable watermarks for language models. In The Thirty Seventh Annual Conference on Learning Theory, pages 1125–1139. PMLR, 2024.

[116]

Nikola Jovanović, Robin Staab, and Martin Vechev. Watermark stealing in large language models. arXiv preprint arXiv:2402.19361, 2024.

[117]

Liyan Xie, Muhammad Siddeek, Mohamed Seif, Andrea J Goldsmith, and Mengdi Wang. Detecting post-generation edits to watermarked llm outputs via combinatorial watermarking. arXiv preprint arXiv:2510.01637, 2025.

[118]

Guanhua Huang, Yuchen Zhang, Zhe Li, Yongjian You, Mingze Wang, and Zhouwang Yang. Are ai-generated text detectors robust to adversarial perturbations? arXiv preprint arXiv:2406.01179, 2024.

[119]

Hao Fang, Jiawei Kong, Tianqu Zhuang, Yixiang Qiu, Kuofeng Gao, Bin Chen, Shu-Tao Xia, Yaowei Wang, and Min Zhang. Your language model can secretly write like humans: Contrastive paraphrase attacks on llm-generated text detectors. arXiv preprint arXiv:2505.15337, 2025.

[120]

Jiřı́ Milička, Anna Marklová, Ondřej Drobil, and Eva Pospı́šilová. Humans can learn to detect ai-generated texts, or at least learn when they can’t. arXiv preprint arXiv:2505.01877, 2025.

[121]

Dino Pedreschi, Luca Pappalardo, Emanuele Ferragina, Ricardo Baeza-Yates, Albert-László Barabási, Frank Dignum, Virginia Dignum, Tina Eliassi-Rad, Fosca Giannotti, János Kertész, et al. Human-ai coevolution. Artificial Intelligence, page 104244, 2024.

[122]

Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu, Alec Radford, Gretchen Krueger, Jong Wook Kim, Sarah Kreps, et al. Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203, 2019.

[123]

Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, Maria Bannert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke Hüllermeier, et al. Chatgpt for good? on opportunities and challenges of large language models for education. Learning and individual differences, 103: 102274, 2023.

[124]

Shakked Noy and Whitney Zhang. Experimental evidence on the productivity effects of generative artificial intelligence. Science, 381 (6654): 187–192, 2023.

[125]

Tatsuya Amano, Valeria Ramı́rez-Castañeda, Violeta Berdejo-Espinola, Israel Borokini, Shawan Chowdhury, Marina Golivets, Juan David González-Trujillo, Flavia Montaño-Centellas, Kumar Paudel, Rachel Louise White, et al. The manifold costs of being a non-native english speaker in science. PLoS Biology, 21 (7): e3002184, 2023.

[126]

Haley Lepp and Daniel Scott Smith. " you cannot sound like gpt": Signs of language discrimination and resistance in computer science publishing. arXiv preprint arXiv:2505.08127, 2025.

[127]

Tuhin Chakrabarty, Philippe Laban, and Chien-Sheng Wu. Can ai writing be salvaged? mitigating idiosyncrasies and improving human-ai alignment in the writing process through edits. arXiv preprint arXiv:2409.14509, 2024.

[128]

Jingheng Ye, Shen Wang, Deqing Zou, Yibo Yan, Kun Wang, Hai-Tao Zheng, Zenglin Xu, Irwin King, Philip S Yu, and Qingsong Wen. Position: Llms can be good tutors in foreign language education. arXiv preprint arXiv:2502.05467, 2025.

[129]

Eugenia Iofinova, Andrej Jovanovic, and Dan Alistarh. Position: It’s time to act on the risk of efficient personalized text generation. arXiv preprint arXiv:2502.06560, 2025.

[130]

Kowe Kadoma, Danaë Metaxa, and Mor Naaman. Generative ai and perceptual harms: Who’s suspected of using llms? In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–17, 2025.

[131]

Steffen Eger, Yong Cao, Jennifer D’Souza, Andreas Geiger, Christian Greisinger, Stephanie Gross, Yufang Hou, Brigitte Krenn, Anne Lauscher, Yizhi Li, et al. Transforming science with large language models: A survey on ai-assisted scientific discovery, experimentation, content generation, and evaluation. arXiv preprint arXiv:2502.05151, 2025.

[132]

Mohammad Hosseini, David B Resnik, and Kristi Holmes. The ethics of disclosing the use of artificial intelligence tools in writing scholarly manuscripts. Research Ethics, 19 (4): 449–465, 2023.

[133]

Peter Cardon, Carolin Fleischmann, Jolanta Aritz, Minna Logemann, and Jeanette Heidewald. The challenges and opportunities of ai-assisted writing: Developing ai literacy for the ai age. Business and Professional Communication Quarterly, 86 (3): 257–295, 2023.

[134]

Tom S Juzek and Zina B Ward. Why does chatgpt" delve" so much? exploring the sources of lexical overrepresentation in large language models. arXiv preprint arXiv:2412.11385, 2024.

[135]

Jiazhou Ji, Ruizhe Li, Shujun Li, Jie Guo, Weidong Qiu, Zheng Huang, Chiyu Chen, Xiaoyu Jiang, and Xinru Lu. Detecting machine-generated texts: Not just" ai vs humans" and explainability is complicated. arXiv preprint arXiv:2406.18259, 2024.

[136]

Shushanta Pudasaini, Luis Miralles, David Lillis, and Marisa Llorens Salvador. Benchmarking ai text detection: Assessing detectors against new datasets, evasion tactics, and enhanced llms. In Proceedings of the 1stWorkshop on GenAI Content Detection (GenAIDetect), pages 68–77, 2025.

[137]

Andrea Pedrotti, Michele Papucci, Cristiano Ciaccio, Alessio Miaschi, Giovanni Puccetti, Felice Dell’Orletta, and Andrea Esuli. Stress-testing machine generated text detection: Shifting language models writing style to fool detectors. arXiv preprint arXiv:2505.24523, 2025.

[138]

Zhaoxi Zhang, Xiaomei Zhang, Yanjun Zhang, He Zhang, Shirui Pan, Bo Liu, Asif Qumer Gill, and Leo Yu Zhang. Character-level perturbations disrupt llm watermarks. arXiv preprint arXiv:2509.09112, 2025.

[139]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024.

[140]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.

[141]

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024.

[142]

Alan M Turing. Computing machinery and intelligence. In Parsing the Turing test: Philosophical and methodological issues in the quest for the thinking computer, pages 23–65. Springer, 2007.

[143]

Tal Schuster, Roei Schuster, Darsh J Shah, and Regina Barzilay. The limitations of stylometry for detecting machine-generated fake news. Computational Linguistics, 46 (2): 499–510, 2020.

On the Detectability of LLM-Generated Text:
What Exactly Is LLM-Generated Text?

Abstract

1 Introduction↩︎

2 Definition and Usage↩︎

3 Background↩︎

4 Evaluation and Benchmark↩︎

5 Attack, Watermarking, and Evolution↩︎

6 Ethical Considerations↩︎

7 Case Study↩︎

7.0.0.1 Original Text

7.0.0.2 GPT-4o & (P1):

8 Conclusions↩︎

Acknowledgements↩︎

9 Simulation Results↩︎

References↩︎

Subjects

Updated on Academus

On the Detectability of LLM-Generated Text: What Exactly Is LLM-Generated Text?

Abstract

1 Introduction↩︎

2 Definition and Usage↩︎

3 Background↩︎

4 Evaluation and Benchmark↩︎

5 Attack, Watermarking, and Evolution↩︎

6 Ethical Considerations↩︎

7 Case Study↩︎

7.0.0.1 Original Text

7.0.0.2 GPT-4o & (P1):

8 Conclusions↩︎

Acknowledgements↩︎

9 Simulation Results↩︎

References↩︎

Subjects

Updated on Academus

On the Detectability of LLM-Generated Text:
What Exactly Is LLM-Generated Text?