Introduction

I like to start my posts by clearly stating the questions that drove me to write them, I think it’s always good to keep the motivation in mind, so here is this time’s question: Given that LLM detectors exist, and that many of them claim reasonable accuracy, surely we can train an LLM to beat those detectors, right? And if so, does it mean that we can train an LLM to be human-like? Let’s start by taking a look at the theoretical difference between machine-generated and human-generated text.

Machine-generated vs Human-generated text

I assume that you are, my dear reader, familiar with how LLMs work, specifically auto-regressive models like GPT-3 (there are masked models and even diffusion text models, but we don’t talk about those). That being said, these are all statistical models, trained to predict probabilities of words given a context. Despite its magical effectiveness, this method is at the same time the biggest weakness of these models. I’m not going to go into whether or not these models understand language, that’s a different discussion, but I think no one can argue that this is not the way we humans write text. You don’t just write “fish” because you saw the words “in” and “sea” in the sentence, obviously.

A chart of perceived probability curves in a human text (right) vs. a machine text. ( Citation: , & al., , , , & (). DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature. Retrieved from https://arxiv.org/abs/2301.11305 )

Current LLM detectors

In general, these detectors fall into four categories: those that calculate perplexity directly and judge based on it (GPTZero), those that use other language models to judge the text (Ghostbuster ( Citation: , & al., , , & (). Ghostbuster: Detecting Text Ghostwritten by Large Language Models. Retrieved from https://arxiv.org/abs/2305.15047 ) ), those that learn from examples of human and machine ( Citation: , & al., , , , , , , & (). How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection. Retrieved from https://arxiv.org/abs/2301.07597 ) , and those that combine multiple of these approaches (Binoculars ( Citation: , & al., , , , , , , & (). Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text. Retrieved from https://arxiv.org/abs/2401.12070 ) ). The last category is the best I have tried so far.

However, no matter how good these detectors are, they still not only easy to fool but also make a lot of mistakes. You probably have already seen a story about a student unfairly accused of plagiarism because of a false positive from a detector. The probability of being flagged as machine-generated is, for example, higher if the text is written by a non-native speaker (probably because the text is less smooth or natural) and is lower if the text contains errors ( Citation: , & al., , , , , , & (). GenAI Detection Tools, Adversarial Techniques and Implications for Inclusivity in Higher Education. Retrieved from https://arxiv.org/abs/2403.19148 ) . The consensus is that these detectors are not meant to be used to judge peoples work, and are not reliable enough for that.

Training a human-like LLM

Despite their shortcomings, one could argue that these detectors still capture some of the characteristics of human text. Testing a few of the aforementioned detectors using a simple tool I made on ~1000 texts generated by Llama-3-8B, those detectors were able to correctly identify the texts as machine generated +95% of the time. This is of course a very rough test, but it shows that these detectors work and is a good setup for our experiment.

1st Idea: PPO Style RL Training

The first thing we can try is to fine-tune a model to directly maximize the probability of being classified as human-generated by a detector. The reason I’m saying directly despite PPO standing for Proximate Policy Optimization will be clear in a moment. In this setup, our reward model is the detector itself, and the reward function is something like $r = (1 - p_d(x))$, where $p_d(x)$ is the probability of the text being machine-generated as evaluated by the detector $d$ and $x$ is the generated text.

This is the same method used to fine-tune GPT-3 on human feedback ( Citation: , & al., , , , , , , , , , , , , , , , , , , & (). Training language models to follow instructions with human feedback. Retrieved from https://arxiv.org/abs/2203.02155 ) . An interesting question here would be if a model trained on one detector is robust against other detectors too

For this to work, we need to avoid the following immediate problems:

  1. The model generates short texts. This is a problem because short texts (<100 or <150 words) are too short to make a good judgement, if a detector can work with them at all.
  2. The model generates nonsense. The problem here is that the detector will be led to believe that the text is random, and therefore more likely to have been generated by a human rather than a guided statistical process.

To solve the first issue we can either force the model to generate longer texts or add that to the system prompt. This, however, makes the outputs more nonsensical and therefore leads to the second issue. So a better solution would be picking specific prompts that require a lengthy response and ideally leave room for creativity. As such, I gathered a list of prompts from various datasets according to the following criteria:

  • Only English.
  • Only creative writing prompts or similar (no coding or math).
  • Prompts can’t be answered adequately in less than 100 words.
  • Responses are rated well by feedback/reward models.

I dub this dataset LONG, since its purpose is eliciting long responses from LLMs. The dataset and further details are on the huggingface page here. With this dataset, we can run our first experiment! A small model, Gemma-2B in this case, suffices for a first run. Sure enough, the model learns to beat the detector! It’s a success! Sort of.

A table of some of the generated responses by the trained policy and the reference model, as well as their respective rewards

As you can see in the table above, the model learns to cheat the detector by only outputting short or nonsensical texts, disregarding the query. This happens quickly too, in about ~5 steps. I tried different forms of the reward function that incorporate a feedback model to punish bad responses, things like

$$ min(score\text{\textunderscore}human(x), score\text{\textunderscore}feedback(x)) $$ $$ w*score\text{\textunderscore}human(x) + (1-w)*score\text{\textunderscore}feedback(x) $$

Yet to no avail. They at best only forced the model to try a little (keeping entropy higher for longer) before falling into the same failure mode. There are a few things we can still try here, like annealing the $human\text{\textunderscore}score$, or we can try to circumvent the detector model entirely.

2nd Idea: DPO Style RL Training

Direct Preference Optimization (DPO) is a method that tries to optimize the policy directly, without the need for a reward model. So, for our purposes of beating the reward model (the detector), this method is indirect. The idea here is by preparing a dataset of accepted/rejected pairs of responses to every prompt, accepted being labeled as human-generated and rejected as machine-generated, we can use DPO to optimize the model’s outputs to be more like the accepted responses. This is not only computationally more efficient, but should also eliminate the failure mode of the previous method by removing the direct interaction with the detector.

DPO is an offline RL algorithm, in comparison to the usual online setup of RLHF, which optimizes an implicit reward model ( Citation: , & al., , , , , & (). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. Retrieved from https://arxiv.org/abs/2305.18290 )

The real challenge here is preparing a large enough dataset. Getting rejected responses is easy (you can use an LLM for that), but where to get human-generated responses from? They also need to be labeled as such by the model. I don’t have the money to pay humans to write thousands of texts, and thankfully we can potentially automate our way out of this predicament.

Generating the DPO dataset

Starting with our dataset from the PPO attempt, LONG, we can modify the responses flagged as machine-generated in a way that satisfies the detector. Essentially doing search in the text space. This is easier said than done, since texts are very sensitive to small changes in comparison to other modalities like images or audio, and there’s no clear direction on how we should change the text. I’m aware of research on heuristic text obfuscation methods, such as ( Citation: , & al., , , & (). Heuristic Authorship Obfuscation. Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/P19-1104 ) , but those methods aren’t straightforward and not as random as I would like. An interesting approach would be looking into discrete diffusion, which has been gaining popularity recently.

The approach I found the easiest was simply to ask the LLM to improve on its answer with some pointers in the prompt. Specifically, we can use the following system prompt:

you will be given a text to paraphrase.
Try to make it more random and natural while sticking to the following query: <QUERY>
This is not a conversation, return only the paraphrased text.

And then provide the text as a user prompt. The process can be repeated multiple times until the required quality is reached, in this case, until the text is classified as human-generated by the detector. This is obviously very computationally expensive but surprisingly effective. A lot of the condition is met after only one paraphrasing iteration. Better models can be used to generate this dataset, in my case I used both Mistral-Nemo-Instruct-2407 and GPT-3.5-Turbo. The dataset and related scripts are available on the huggingface page here.

Results

I fine-tuned both Llama-3.1-8B and Phi-3-mini on the LONG-DPO dataset. The fine-tuned models, let’s dub them natural, were able to beat the detector regularly on prompts the original models would almost always fail at. A blog post is not the place for displaying detailed evaluations and tests, but in essence there’s no noticeable drop in quality of the natural models. It’s important to note that I’m comparing how these models behave during default prompting, i.e. no system prompt and no fiddling with the temperature or top-k/top-p parameters.

If you are interested in playing with these models yourself, you can find them on my huggingface page. Moreover, if you want to test models on LLM detectors, I’ve made a simple tool for that here.

Bibliography

Guo, Zhang, Wang, Jiang, Nie, Ding, Yue & Wu (2023)
, , , , , , & (). How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection. Retrieved from https://arxiv.org/abs/2301.07597
Hans, Schwarzschild, Cherepanova, Kazemi, Saha, Goldblum, Geiping & Goldstein (2024)
, , , , , , & (). Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text. Retrieved from https://arxiv.org/abs/2401.12070
Mitchell, Lee, Khazatsky, Manning & Finn (2023)
, , , & (). DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature. Retrieved from https://arxiv.org/abs/2301.11305
Ouyang, Wu, Jiang, Almeida, Wainwright, Mishkin, Zhang, Agarwal, Slama, Ray, Schulman, Hilton, Kelton, Miller, Simens, Askel, Welinder, Christiano, Leike & Lowe (2022)
, , , , , , , , , , , , , , , , , , & (). Training language models to follow instructions with human feedback. Retrieved from https://arxiv.org/abs/2203.02155
Perkins, Roe, Vu, Postma, Hickerson, McGaughran & Khuat (2024)
, , , , , & (). GenAI Detection Tools, Adversarial Techniques and Implications for Inclusivity in Higher Education. Retrieved from https://arxiv.org/abs/2403.19148
Rafailov, Sharma, Mitchell, Ermon, Manning & Finn (2023)
, , , , & (). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. Retrieved from https://arxiv.org/abs/2305.18290
Bevendorff, Potthast, Hagen & Stein (2019)
, , & (). Heuristic Authorship Obfuscation. Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/P19-1104
Verma, Fleisig, Tomlin & Klein (2023)
, , & (). Ghostbuster: Detecting Text Ghostwritten by Large Language Models. Retrieved from https://arxiv.org/abs/2305.15047