You’ve Got MLAI: When GPT-3 Goes Off the Metaphorical Rails

GPT-3is a machine learning model from OpenAI. It generates human-like text, based on an enormous dataset of existing content on the internet. When people talk about AI-generated content, they’re usually talking about GPT-3 or tools built on top of it.

GPT-3 is fun to play with and can do some extremely impressive things. It can also get things comically wrong.

Prompt: Write a fun article comparing Zoom and other video conferencing programs to different ice cream flavors.

Output:
When it comes to choosing the right video conferencing program, there are many factors to consider. Price, features, ease of use – and of course, flavor. Just like with ice cream, there are a wide variety of video conferencing programs available, each with their own unique flavor. Here’s a look at how some of the most popular video conferencing programs compare to your favorite ice cream flavors. 

Zoom: Chocolate 

Zoom is rich, chocolatey, and perfect for when you need a quick pick-me-up. It’s also one of the most popular video conferencing programs on the market. Zoom is known for being easy to use and offering a wide range of features at a reasonable price. 

Skype: Vanilla 

Skype has been around for quite awhile and is still going strong. It’s reliable and perfect for basic video chat needs. Skype is also available on almost every platform, making it easy to use no matter what device you have. And like vanilla ice cream, Skype is a classic that will never go out of style. 

Other competitors: Stracciatella & Rocky Road 

When it comes to features and price Stracciatella offers more than Skype but less than Zoom whereas Rocky Road offers fewer features than both Stracciatella & Skype but costs more than both as well

Wait, what?

(Also, in case you’re not a diehard dessert fan, stracciatella is basically the same as chocolate chip. I’ve only ever seen it as a flavor of gelato, not ice cream, but that’s the least of our problems here.)

This is a perfect example of some of the strangeness you can run into with GPT-3. 

It does impressively well at a lot of natural language tasks, like making a list of sci-fi books, writing a paragraph about air conditioners, or even classifying tweet sentiments as positive or negative. I was curious to see what it could do with something more abstract, so I asked it to compare video conferencing platforms to ice cream flavors.

To the credit of machines everywhere, GPT-3’s response starts off strong and seems to ‘understand’ the metaphor I want to build. True, the second sentence in the intro—“Price, features, ease of use – and of course, flavor”—makes me wonder how literally it’s taking the comparison, but the Skype-vanilla section is dead-on.

In the final section, though, it goes off the rails, and I’m reminded of a very important fact:

GPT-3 doesn’t actually ‘understand’ anything and doesn’t know the difference between a frozen dessert and a video call, except insofar as those two concepts are usually surrounded by different words.

When working with AI and machine learning, you forget this at your peril.

Hi, I’m Megan

I work on the marketing team at Verblio. Before that, I was a freelance writer, including for a lot of Verblio customers. On weekends I ride a bike or motorcycle, and sometimes I write literary fiction. It’s unclear if the world will ever see it. 

I’m exploring AI and ML (machine learning) for the company because, well, it’s important to our future to understand it. We’re not using AI or ML to write content yet. (Not opposed, but as you’ll see in this series of articles, the tech isn’t ready for the goals of content marketers.) In the future, we might. We believe the future of content creation may have some cyborg-like construct, and that’s exciting.

In this series, I’ll be sharing some of my experiments, musings, and ideas.

Human creativity vs. machine creativity

Related to the above, there’s a pun to be made regarding ice cream and a frozen video call. I asked GPT-3 for one, and here’s what it came up with:

Prompt: Write a joke connecting frozen treats with a frozen Zoom call.

Output:
Why did the ice cream sundae join the frozen Zoom call?

Because it wanted to be a part of the cool crowd!

Oof. That’s almost funny? Given that “frozen” is the connection between the two subjects, though, I was expecting that to be in the punchline, like “What did the ice cream and the glitchy Zoom call have in common?” 

“They were both frozen!”

(Okay, maybe that’s not much better, but at least we’re squarely in dad-joke territory now.) 

Deeper into dad joke backcountry

To be fair, I could probably get a better joke with more guidance in the prompt. What would be most valuable, though, is if GPT-3 could generate that joke itself, given the original article prompt. 

This leads us to a hierarchy of creative capabilities for machine learning and metaphors: 

  1. Ability to generate text output that kinda looks like what I want but isn’t (This is where we currently are with GPT-3.)
  2. Ability to generate text output that is what I want
  3. Ability to generate text output that is what I want and includes a metaphor on its own without specific direction to do so
  • Example prompt: “Write a fun article comparing different video conferencing platforms.”
  • Output: an article comparing those platforms to ice cream flavors
  1. Ability to generate text output that is what I want, which includes a metaphor on its own and makes a good joke about it
  • Example prompt: “Write a fun article comparing different video conferencing platforms.” 
  • Output: an article comparing those platforms to ice cream flavors that includes the obvious joke comparing frozen desserts to frozen video screens

For a human, this hierarchy is straightforward. Level One is useless, and any writer worth a fraction of their salt can successfully do Level Two: Create an article comparing video conferencing platforms to ice cream flavors, given a brief that asks for “an article comparing video conferencing platforms to ice cream flavors.”

So far, so good.

From there, it’s a small step up for a human to reach Level Three: coming up with the ice cream comparison themselves, as a way of illustrating the differences between various platforms. (At least, it’s a small step up in that the pool of humans who can achieve Level Three is not that much smaller than the pool who can achieve Level Two. How a comparison like that gets generated in the brain is no doubt a very complex thing—but from the outside, it’s one of those things we can do seemingly effortlessly.)

Adding the joke in Level Four is the cherry on top. For a human, this is another of those weird creative magic thingsthat just happens—I didn’t sit down with a plan to figure out where I could add humor, nor did I make any conscious decision that I needed to write a relevant joke. The connection between the two topics simply sprang to mind, I recognized the potential joke, and I included it.

I wasn’t in “writing article mode” and then switched to “writing joke mode.” The two seem to be intertwined and simultaneous, inasmuch as they felt like separate processes at all.

Why originality is hard for machines

GPT-3, on the other hand, hasn’t yet achieved Level Two: following directions to use a metaphor correctly. Level Three—generating a metaphor on its own, without specific guidance—will require still more significant progress beyond that.

Why is it so much harder for a machine when humans can do this so easily?

Because GPT-3 operates on pattern recognition.

If I only ask GPT-3 to compare video conferencing platforms, it can do that relatively well. That type of content exists in a lot of places on the internet and looks pretty similar in most of those places, meaning there are strong patterns for it to identify and follow. 

When I add in the comparison to ice cream flavors, though, I’ve significantly reduced the amount of similar content on the web that GPT-3 can look to for examples. There are a lot fewer articles talking about both ice cream and video conferencing than there are articles talking about just one of those topics. This makes it that much harder for it to follow directions successfully for Level Two, but it also means the odds of it generating that comparison framework on its own (as in Level Three) are extremely low.

The catch-22 is that if there were a lot of content online comparing video conferencing platforms to ice cream flavors, it would stop being anything original or impressive for GPT-3 to write. Without that content, though, there’s no pattern for it to match. That’s why originality is hard for any machine learning model.

Is this an example that represents a true “wall” for AI requiring some dramatic shift in its structure or will it be solved through normal progress? Great question, welcome to the debate.

Humor is human

Finally, let’s look at Level Four: creating a metaphor and adding in the relevant joke. Despite being a relatively minor thing for a human, this seems like it would require exponentially more GPT-3 power, even after it reaches Level Three.

It can write articles and generate decent jokes independently of each other. From an algorithm perspective, though, how would it even go about including a joke, unprompted, in an article? 

A human can realize there’s a funny aside to be made while writing an article and make that slight diversion from the core narrative.

GPT-3, however, not only has to master the art of humor—it has to master the art of recognizing when there is an opportunity for humor.

Until it has a more elegant way to do that, it likely would have to be running at least two processes at the same time, similar to the “writing article mode” and “writing joke mode” I mentioned above. In this case, it might look like Process 1 writing the article, Process 2 creating a “joke” around the latest text from Process 1, and then some additional tool like OpenAi’s “best of” parameterlayered on top, to choose the best one from among all the jokes it’s generating, and adding that to the final text.

Yeah it’s a bit complicated. That’s the point.

This would take exponentially more processing power, and, perhaps more importantly, I would still need to figure out how to define “joke.” Given specific parameters, GPT-3 can generate certain types of jokes—like the example of the New Yorker cartoon captions—but it needs to be given a specific prompt. And until I know what the specific joke is going to be, how do I know whether I should tell it to write a pun or a knock-knock joke?If I know the joke in advance, though, I’ve defeated the purpose of having GPT-3 create it for me.

It might seem like a relatively unimportant thing for content to contain humor, but it’s one of the things that can delight us as an audience—and that we take for granted—when reading all but the most practical of articles.

If I’m trying to learn whether my symptoms are those of a heart attack or not, then no, I don’t want any unrelated asides or humor to get in the way of the information I need. For most other types of reading, however, whether that’s debating which project management software to buy or learning the history of cattle ranching in America, a well-placed quip only enhances the experience and reminds us that we’re humans, not pure information-seeking automatons.

If we want GPT-3 to produce that kind of spontaneous humor, though, we’ll be waiting til the cows come home.

What’s next?

I’m going to keep experimenting with this stuff. Next up is probably a digression on randomness, GPT-3’s temperature parameter, and the nature of creativity. Send an email to megan@verblio.com if you want to chat.

Megan Skalbeck

Megan traffics in words. Whether that’s spinning up a story on the blog or paring down a conversation on the podcast, it’s all elementary math in the end: She adds, subtracts, multiplies for effect, and divides for readability. When she’s not helping words live their most meaningful life, she’s usually in the woods, in the ocean, on a rock, or on the road.

Questions? Check out our FAQs or contact us.