On Tuesday, March 14, OpenAI released its latest machine learning model, GPT-4. While it hasn’t immediately rocked the world in the same way ChatGPT did, that’s mostly because there wasn’t a shiny new interface to go along with it. Trust us — it’s still incredibly exciting.
Thing #1: Multimodality isn’t here yet
Pre-launch, a lot of the hype around GPT-4 was about its being multimodal, or able to accept both text and images as input. Currently, to upload images you need access to the developer API, which is obviously not for everyone. For everyone else, GPT-4 still only accepts text input.
The hype around multimodality is likely warranted. Expanding the input options to both text and images could (should?) exponentially improve the potential output of the AI, and could pave the way for video, audio, and other multimodal inputs and outputs in the future.
Thing #2: GPT-4 can accept much larger inputs
In the absence of multimodality, one of the most obvious ways GPT-4 differs from GPT-3.5 is that it can accept much larger inputs (and produce larger outputs, but that’s not going to be useful in the same way.)
The maximum number of tokens you can use at a time with GPT-3.5 is 4,096.With the base model of GPT-4, that max doubles to 8,192 tokens—and there’s even a second GPT-4 model that can handle up to 32,768 tokens.
What does that mean in practice?
For starters, it means I can give GPT-4 OpenAI’s entire technical report (minus the appendices) on GPT-4 to read. (That’s over 5,000 words of content.) I asked it to summarize the report and call out any important information that was missing.
Here was GPT-4’s response:
Prompt: Summarize the main points of this research paper. What important information is missing? (followed by the full text of OpenAI’s GPT-4 Technical Report)
With GPT-3.5 and earlier models, you couldn’t give it such a long input as an entire technical report. This is a really cool advancement, as you can now provide the model with a lot more information as context.
This capability is especially useful since the model isn’t hooked up to the internet. The only way for it to have new information is if you provide it — and you can now provide it a lot more.
For contrast, if I ask what GPT-4 is without providing the technical report, here’s what I get:
Prompt: What is GPT-4?
…and so on.
As far as GPT-4 knows, GPT-4 is still a hypothetical successor to GPT-3. Which makes sense, because obviously it couldn’t have been trained on text from a world in which GPT-4 already existed. In all the content the model has seen, GPT-4 is still a future development.
What this means, though, is that we can now get much better results from GPT-4 on things like new events or extremely in-depth topics, by providing it much more information in the prompt.
In addition to what this improvement enables, it’s also really interesting to consider from an architecture standpoint. In order to accept more tokens, the model has able to recall and synthesize information over a much larger window. Was this done simply by building a larger model with more layers and parameters, or were fundamental changes made to how it processes and stores information?
Unfortunately, the lack of any answer to that question brings us to our third point.
Thing #3: OpenAI isn’t quite so…open…anymore
One fascinating thing about GPT-4 has absolutely nothing to do with its abilities. From OpenAI’s research paper on it:
This report focuses on the capabilities, limitations, and safety properties of GPT-4. GPT-4 is a Transformer-style model pre-trained to predict the next token in a document, using both publicly available data (such as internet data) and data licensed from third-party providers. The model was then fine-tuned using Reinforcement Learning from Human Feedback (RLHF). Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.
No further details about the model size, dataset, training…anything?
That is wildly not open. It’s also a big departure from OpenAI’s public research on earlier GPTs.
It’s also worth noting how at odds those two reasons for secrecy are: the competitive landscape, and the safety implications of large-scale models. “Safety implications” require caution and prudence, but a “competitive landscape” requires full steam ahead to beat out anyone else.
Leaving users in the dark about dataset construction and training method means that we’ll struggle to identify potential biases in the AI output. After all, human beings made the decisions about those training models and datasets, and those humans have implicit biases. The training data then also has built in bias.
Eliminating that bias is messy, complex, and quickly descends into a rabbit hole of debate only enjoyed by philosophy majors and people who like commenting on local news articles. However, being aware of that bias is important for everyone using AI to create new content.
On a totally unrelated note, two other major AI advancements were released the same day as GPT-4: Anthropic’s Claude model and Google’s PaLM API. Since then, Anthropic has launched Claude 2 and Meta has thrown their hat in the ring with Llama 2. Claude 2 offers up to 100,000 tokens.
Clearly, this arms race is in full swing.
Thing #4: AI is becoming a star student (but still lies)
One of the most widely shared graphs from the launch shows GPT-4’s performance on various tests. It’s almost like OpenAI is still under the illusion, shared by high-achieving high schoolers everywhere, that standardized test scores in some way correlate to real-world success.
What is worth noting, however, is that GPT-4 was not specifically trained to take any of these tests. This isn’t the case of an AI model being specifically trained to play Go and eventually beating the best human player; rather, its ability to ace these tests represents a more “emergent” intelligence.
Previous models like GPT-3 also weren’t trained to take particular tests, but, as you can see, GPT-4’s performance has improved significantly over GPT-3’s:
These graphs look good and have become staples of articles and press announcements featuring new models. But ask yourself: do you really want an AP English student – even a particularly skilled one – in control of your marketing messaging and copywriting? Me neither.
If you don’t care about AI’s ability to take standardized tests and just want to know how well it’s going to do what you want, this is still good news. From the report:
GPT-4 substantially improves over previous models in the ability to follow user intent. On a dataset of 5,214 prompts submitted to ChatGPT and the OpenAI API, the responses generated by GPT-4 were preferred over the responses generated by GPT-3.5 on 70.2% of prompts.
So, GPT-4 is more likely to give you what you’re looking for than GPT-3.5. That’s great. It’s important to keep in mind, though, that in spite of its improved performance, the new model still has all the same limitations we know and love from our existing AI friends.
Another quote from the report:
Despite its capabilities, GPT-4 has similar limitations to earlier GPT models: it is not fully reliable (e.g. can suffer from ‘hallucinations’), has a limited context window, and does not learn from experience. Care should be taken when using the outputs of GPT-4, particularly in contexts where reliability is important.
In fact, hallucinations could become an even bigger problem than they were, simply because the better the AI gets, the easier it will be to believe what it says. With GPT-3 and GPT-3.5, people are well aware the model will totally make stuff up because it happens so frequently. As newer and better models do that less frequently, there’s a greater risk that when they do hallucinate, we may fail to notice or fact-check it.
So stay vigilant, friends. But also, these are very exciting times.