How I tested 55 prompts to create better outlines

Header image created by DALL-E-3 using this prompt: “I NEED to test how the tool works with extremely simple prompts. DO NOT add any detail, just use it AS-IS: A robot writing an outline on a whiteboard in an office. Organize the outline into several sections labeled with roman numerals, and make sure that underlying subsections are indented properly in relation to their main headers.” There’s a lot to unpack in that prompt, as well as in the image—look out for a DALL-E deep dive coming soon to a Verblog near you.

Step into the prompt testing lab

If you’re using AI to create content and don’t think you need to bother testing prompts, skim my last article on why prompt testing is so important.

In this article, I’m going to share how I recently tested 55 prompt variations to change the way we generate outlines for customers of our human-crafted AI content.

My goal here is to help you think about your own testing process. You might have different goals or be using your prompts to do something different than generate outlines, but the general principles and framework are helpful regardless of your use case.

A quick glossary:

  • Prompt variation: If I test multiple different prompts aimed at the same goal, e.g. writing an outline, those are prompt variations. A given variation could be a single prompt, or it could include multiple prompts in a chain.
  • Input: I’m using “input” to refer to the specific variables used within a prompt. Creating prompts with these variables in place allows you to reuse the same prompt over and over.
  • Output: Output refers to the LLM’s response to a prompt. In ChatGPT, this is the response you see in the window. Via the OpenAI API, this is the response in the response.choices[0].message.content field. When using a prompt chain, I’m using “output” to refer to the final output (ie. the one with the content I actually want, rather than the model’s intermediate response.)

Two commandments of LLM testing

1. Define good as quantitatively as you can

Testing LLMs often starts with an ambiguous idea of “I want to see which prompt gets me a better output.” Some of that “better” may be subjective, and there’s no way around that. But coming up with at least a few quantitative measures will make it much easier to evaluate the outputs of the prompts you’re testing, even if it’s just knowing a general range you want that measure to fall within.

Example metrics I’ve used for different tests:

  • Word count: When generating introductions, for example, I wanted to keep them in a certain word count range.
  • Reading level: In order to target a certain reading level, I automated running the prompt outputs through a tool like Readability to compare the reading levels. (If I had read this article on GPT-4’s ability to evaluate readability first, I would have just used that model instead of a separate tool. Heads up, that article contains a ton of statistical concepts but is well worth skimming if you’re interested in readability at all.)
  • Number of times a keyword is used
  • Whether a prohibited word is used
  • Length relative to the original: For example, I was building a tool to remove some of the fluff from AI-generated content and rewrite it more concisely. I cared about how long the rewritten text was relative to the original because I didn’t want to pare it down too much, but I also wanted to be sure it wasn’t making it longer. Word count alone wouldn’t have told me what I needed to know—I needed to evaluate the output relative to the specific input.
  • Runtime: If someone will be waiting in real-time for the output, I don’t want to use a prompt chain that takes minutes to run.

Most likely, you won’t be able to reduce all of your evaluation to quantitative metrics. At some point, you’re actually going to have to review the outputs and decide for yourself that “It was the best of times; it was the worst of times” is a stronger opening sentence than “It was an era characterized by both joy and sadness.” At the very least, though, having some metrics in place will allow you to eliminate certain outputs off the bat, reducing the number you need to manually review.

Using AI to evaluate outputs

Wondering if you can use AI to help you qualitatively evaluate outputs? Research suggests that GPT-4 can reach 80 percent agreement with human preferences—which, for the record, is the same level of agreement humans reach with each other. I’m wary of relying exclusively on this approach, though, because I did my own testing with it over the summer, and the results weren’t exactly confidence-inspiring.

How I tested: I presented a pair of options to GPT-4 and asked it to evaluate which one was a better example of a certain voice. I used a low temperature to reduce the variability, and ran the exact same prompt with the same choices twice—and then reran the prompt with the same two choices but reversed the order in which the choices were given.

In total, GPT-4 compared the same two choices four times. I did this for 274 different pairings, and the model only had unanimous agreement with itself (meaning it chose the same choice all four times, regardless of whether that choice was presented first or second) on 53 percent of those pairings.

n = 274

That’s the pink pie slice above. The second most common outcome (the purple slice) was the model choosing each option in the pair twice, meaning its choice was entirely arbitrary.

It’s worth highlighting that these stats only measure GPT-4’s consistency when evaluating pairs and don’t even begin to address whether its choice was actually “right,” ie. if it would match a human evaluator’s preference. Precision and accuracy: You need both if you’re going to use AI as an evaluation tool.

All this isn’t to say that using AI to judge its own outputs is impossible. I could no doubt raise its level of consensus (precision) by improving the prompt that asks it to evaluate the two pairs, and providing examples of my own choices in that prompt would likely help to align it more closely with a human’s preferences (accuracy). That takes even more time and testing, though.

Bottom line: If you outsource qualitative evaluation to AI without putting in a lot of time first to make sure its evaluations are A) consistent, and B) aligned with your preferences, your results won’t be very good.

2. Test on multiple inputs

If you’re using AI at scale to create content, you need to test your prompts on multiple inputs. Unless you’re using a very low temperature, LLMs will give you different outputs every time even for the same input, and its performance will vary even more on different ones.

Be sure, too, that your inputs represent the range of how you’ll be using that prompt. If I create content for several different industries, for example, I’m going to make sure the inputs I use for testing aren’t all from a single industry. Similarly, if I want to use the same prompt to generate outlines for articles ranging from 600 to 2000 words, I’m going to include a range of word counts in my inputs. Otherwise, I might end up with a prompt that generates great outlines for 2000-word articles, but not for 600-word articles.

For testing a prompt to create outlines, for example, I might use a spreadsheet of inputs that looks like this:

spreadsheet showing information for six different articles

Each row represents a different set of inputs. I would run the same prompt six times, each time replacing variables in the prompt like {topic} or {word_count} with the actual values from one of the rows.

My prompt testing process

With those principles in place, let’s take a look at how I tested 55 different prompts to generate outlines for our customers. I’ll cover what I was looking to improve, the tools and process I used to test the different prompts, the resulting metrics, and how I evaluated the winning prompt.

What I wanted to improve

I wanted to make a few specific improvements to the outlines that were being generated for our customers:

  • Shorter outlines: Our existing outlines often included too many sections, resulting in the final article being too long for the designated word count.
  • Reduced risk of hallucination: If the outline included sections like “Case Studies,” “Testimonials,” or “References,” AI would inevitably try to make up that information when writing the article, which meant extra work for our human writers. I wanted to improve our process to prevent the AI from including those sections at all.
  • Better outlines for the format: For example, if the customer’s topic is a listicle like “X Best VPNs,” the headings in the outline should each be a specific VPN rather than “VPN #1,” “VPN #2,” etc., and those sections should comprise the bulk of the article. I also wanted to make sure our outlines did a better job keeping the reader’s intent in mind and covering the information they’d expect to see when searching for the customer’s keyword.

A final consideration for the quality of our customer experience, though not for the quality of the content itself, was how long it takes for the outline to be generated. Because customers are in our app waiting in real-time for the outline to appear so they can review and edit it before finalizing their order, it matters whether they have to wait ten seconds or a minute.

We really want our customers to review and edit the outline so we can be confident we’re covering what they want. The longer they have to wait, the less likely they are to do that.

The process

Google Sheets and Google Colab are my best friends.

In one sheet, I came up with an initial list of prompt variations. In some cases, the difference between two prompts would be just a few words. In others, they would look totally different. Here’s an example:

Prompt variation #1
write an outline for the topic: {topic}
word length: {word_count}

As you can see, I started extremely simple to understand what the LLM would do with minimal direction. For other variations, I used more sophisticated prompting strategies:

Prompt variation #5
You will be writing an outline for a given topic. First, think through how the article should be structured, given the searcher intent for the keyword. Provide these thoughts inside <analysis></analysis> tags. Then, provide the outline itself inside <outline></outline> tags.
topic: {topic}
keyword: {keyword}

In a second sheet, I stored the brief information for 30 different real articles that already had been ordered and delivered to our customers, along with the outlines that originally were generated for those. 

screenshot of Verblio's content order form
We’ve intentionally kept our content order form minimal and structured, but our prompts still need to account for a wide range of inputs.

The next step involved using OpenAI’s API. If you’re not comfortable writing code but have access to a low- or no-code tool like Make or Zapier, you could access OpenAI’s models that way instead. Either way, it is abundantly easier than copy/pasting prompts and outputs from a ChatGPT window, and the only viable way to do real testing at scale.

Using a Python program in a Colab notebook, I sent a prompt to the model (mostly either GPT-4 or GPT-3.5-turbo).This prompt was created by replacing the variables in one of the prompt variations from the first sheet with one of the 30 sets of inputs from the second sheet—and doing that again until I had prompted the model with every combination of prompt and inputs. The program then automatically saved the resulting outlines to a third sheet.

screenshot of python code in a Google Colab notebook
This is the main part of my code, where I’m grabbing article inputs and prompt variations from two different sheets, and running each set of inputs through each prompt variation.

For every new outline the model generated, I then evaluated the quantitative metrics I cared about, based on those improvements I identified above:

  • How much shorter was it than the outline we had previously generated for the customer using our existing prompt flow?
  • Did it include any sections we didn’t want to see, like case studies or references?
  • How long did it take to run?

I aggregated these metrics for each prompt variation and compared the overall results.

I couldn’t rely only on numbers, so I also manually reviewed the outlines to see whether listicles were properly formatted, if they made sense, etc.

I then iterated on the best-performing prompt variations to see if I could further improve the results, and did the same process again. And again. And again, and again, and again.

The results

By the end, I had tested 55 different variations of prompts, models, and temperatures. The results for some of them are in the chart below.

spreadsheet showing the results of testing different prompt variations
n = 30

First callout: You can see the results getting better (more green, less red) as I iterated further. This is why testing matters. You can make very real improvements, across multiple dimensions, that will mean significant time savings when you’re running these prompts over 100s of cases.

Columns B through E are all about how much shorter the new outline was than the one we had previously generated. Column F shows how long it took each prompt (or prompt chain, in some cases) to run, which is approximately how long our customer would have to wait in the app. Column G shows how many of the new outlines contained a section it shouldn’t have, like “Case Studies.”

Consistency matters

The main reason it’s so important to test your prompts on multiple inputs (30, in my case) is because it will behave differently every time. This was very important for us when looking at how much shorter the new outline was than the old one.

The median reduction (column B) is self-explanatory, but if we’d looked only at that measure, we wouldn’t have learned anything about how consistent that prompt variation was across inputs. Looking also at the minimum reduction (column C) was important because this showed the worst-case scenario: Each prompt variation actually resulted in a longer outline than the original for at least one of the test articles. For prompt 41, that worst case meant getting an outline that was more than twice as long as the one we’d originally gotten with our current prompts. For prompt 55, on the other hand, that worst case was significantly better, with the new outline being only 10 percent longer than the original.

The maximum reduction (column D) isn’t color-coded because I wasn’t aiming for a particular percentage reduction, though the 84 percent reduction for prompt 43 is probably too high. What’s more important for understanding how consistently a prompt behaved is the spread between the minimum and maximum reductions: column E. The lower that number, the more consistent the outputs from that prompt were, which is what we want.

On runtime

Two main factors impacted the runtime (column F):

  1. the LLM being used
  2. the number of prompts, ie. whether it was a single prompt or a prompt chain

The length of the prompt also impacts the runtime, but to a much lesser degree than those two factors.

The tradeoff is that you can often get qualitatively better results by using longer prompts or prompt chains, but it will then take longer to run. However, different models also have different runtimes. In general, older, smaller models are faster, while newer ones like GPT-4 are slower, due to both their size and higher traffic.

The winning prompt

The prompt variation that ended up being the best overall, on both quantitative and qualitative measures, was number 54.

You can see the results for prompt 54 met my original goals:

  • It consistently resulted in shorter outlines (but not too short!) and had a relatively low spread between the minimum and maximum reduction (column E).
  • The median runtime of 15 seconds (column F) wasn’t the lowest, but it was still less than half the average runtime for the prompt we were currently using.
  • It never included a section we didn’t want to see in the outline (column G).
  • When I reviewed the outlines manually, they were what we wanted in terms of quality and format, etc.

I’ll go into the exact prompting strategies in more detail in my next article, but in a nutshell, here’s what made prompt 54 work so well:

  • Giving the model time to “think”
  • Providing examples of what I wanted
  • Using a prompt chain (rather than a single prompt) to improve its accuracy at meeting specific requirements—but doing this on an older model to keep the runtime relatively low

Could I have kept going with more prompt variations and seen more improvements? Sure. But at some point, I wanted to get a better prompt into production so our customers could start seeing the improvements sooner rather than later.

Moral of the story: Test a reasonable amount, but don’t let perfect become the enemy of good. As the gains from new variations become smaller, you’ll want to declare a winner and get on with your life.

In the next article, I’ll get into the content of the prompts themselves by sharing the specific prompt strategies I tested and tips for writing prompts that work at scale. If you have questions about the testing setup, the Python code I used, or anything else, send a message to megan@verblio.com.

Avatar photo

Megan Skalbeck

Megan has been following the world of AI since the initial GPT release in 2018. As Head of AI Projects at Verblio, she's responsible for figuring out the best ways to blend the capabilities of artificial intelligence with the quality of our human freelance writers. When she's not doing tech things, she's making music, writing existentialist fiction, or getting reckless on two wheels.

Questions? Check out our FAQs or contact us.