Last week OpenAI released ChatGPT-4o. While others debated the ethics of mimicking the voice of Scarlett Johansson, I ”kicked its wheels“ by conducting the same test I use to illustrate to students the limitations of ChatGPT 3.5:

Give a list of 10 composers born between 1680 and 1690.

The responses never fail to surprise and delight.

First, the figures students expect to encounter – namely, George Frideric Handel (1685) and Johann Sebastian Bach (1685) – generally appear lower down the list than a human might expect. Turns out, the names to come first to our minds aren’t those that come to ChatGPT’s ”mind.”

Second, ChatGPT 3.5 always provides 2–4 composers born either before 1680 or after 1690. I’ve entered this prompt dozens of times, and ChatGPT 3.5 has never scored 100%. Antonio Vivaldi (1678) is a common response. Jean-Féry Rebel (1666) and Giovanni Battista Pergolesi (1710) appear frequently, too.

To the humor and surprise of my students, the technology neglects to conduct even the most rudimentary check of its responses. Instead, fine print accompanies each response – in a smaller font, of a lighter color, at the bottom of the page – “ChatGPT can make mistakes. Check important info.”

That’s especially useful information, I suspect, for those who use ChatGPT to diagnose medical conditions and draft legal briefs.

Enter ChatGPT-4o. The new and improved version touts “improved reasoning” that “pushes the boundaries of deep learning.” But how would it fare in my simple road test?

VIEW RESULTS

Accuracy: 40%. No Handel, no Bach … and an odd fascination with Jean-Féry Rebel.

ChatGPT 4 Road Test

VIEW RESULTS