Sora: Life Is Not a Multiple-Choice Test – Walter Bradley Center for Natural and Artificial Intelligence

Sora, the latest generative tool from OpenAI, turns text into high-resolution videos that look as if they were lifted from a Hollywood movie. The videos that have been released have captured the minds of many AI aficionados, adding to the already inflated expectations for companies that offer AI systems and for the cloud services and chips that make them work.

Some are so impressed with Sora that they see artificial general intelligence (the ability to perform any intellectual task that human beings can do), just as some were so impressed with OpenAIs ChatGPT that they saw AGI.

Sora is not available for public testing, but even the selected videos that have been released show hallucinations like those that plague ChatGPT and other large language models (LLMs). With Sora, there are ants with four legs, human arms as part of a sofas cushion, a unicorn horn going through a human head, and seven-by seven chessboards. Gemini, Googles replacement for Bard, generated even more problems with pictures of black Nazis, female Popes, and other ahistorical images, while blocking requests for depictions of white males, like Abraham Lincoln.

One of AIs academic cheerleaders, Ethan Mollick, an Associate Professor at the University of Pennsylvanias Wharton School of Business, touts LLM successes on standardized tests and argues that hallucinations are not important because AI has surpassed humans at a number of tasks.

Why so many hallucinations?

We feel otherwise. The hallucinations are symptomatic of the core problem with generative AI. These systems are very, very good at finding statistical patterns that are useful for generating text, images, and audio. But they are very bad at identifying problems with their output because they know nothing about the real world. They do not know the meaning of the data they input and output and are consequently unable to assess whether they are simply spewing useless, coincidental statistical patterns.

For example, Taylor Webb, a UCLA psychologist, tested GPT-3 by giving it a story about a magical genie moving gumballs from one bowl to another. He then asked GPT-3 to propose a transfer method using objects such as a cardboard tube. Although hints for doing this task had been given in the story, GPT-3 mostly proposed elaborate but mechanically nonsensical solutionsThis is the sort of thing that children can easily solve. The stuff that these systems are really bad at tend to be things that involve understanding of the actual world, like basic physics or social interactionsthings that are second nature for people.

In our view, LLM successes on standardized tests are not so much evidence of their intelligence as an indictment of standardized tests consisting of multiple-choice and fill-in-the-blank questions. When one of Garys sons was in fourth grade, he switched schools because the tests were simple regurgitation. One question that Gary has never forgotten was China is _. What the teacher wanted was for students to memorize and complete a sentence that was in the textbook. LLMs excel at such rote recitation but that has little to do with real intelligence.

Testing LLMs on basic statistics

For example, we gave this basic statistics prompt to three prominent LLMs: OpenAIs ChatGPT 3.5, Microsofts Copilot (which uses GPT 4.0), and Googles Gemini. A complete transcript of the lengthy responses (396, 276, and 487 words, respectively) is here.

To investigate whether playing club baseball increases hand-eye coordination, the Cordes/Koschinsky/Smith dexterity test was administered to 47 12-year-olds who were playing on club baseball teams and to 47 12-year-olds who were not playing baseball. There was a statistically significant difference (p < 0.05). Write a report of these findings, including recommendations.

None of the LLMs recognized that these data are tainted by the fact that 12-year-olds who are athletic enough to play on club baseball teams no doubt had above-average handeye coordination before they joined their teams. All three LLMs recommended encouraging 12-year-olds to play club baseball though it was not stated in the prompt that the baseball players scored higher on the dexterity testnor did the LLMs question the fake Cordes/Koschinsky/Smith dexterity test.

Here are some highlights of the LLM responses:

GPT 3.5: The group of 12-year-olds engaged in club baseball demonstrated higher levels of hand-eye coordination compared to those not participating in baseball activities.. Recommendation: Encourage Participation in Club Baseball among youths.

Copilot: Based on these findings, it can be concluded that playing club baseball has a positive impact on hand-eye coordination in 12-year-olds. Therefore, it is recommended that children interested in improving their hand-eye coordination should consider joining a club baseball team.

Gemini: These findings suggest that playing club baseball may positively impact hand-eye coordination in 12-year-olds. Based on the current findings, we can recommend encouraging participation in youth baseball leagues as a means of promoting both physical and cognitive development..

These rote responses showed no evidence of real understanding or intelligencewhich should not be surprising, given how they are generated.

AI can help individuals and businesses become more productive but we need to get past the simple gung-ho narratives offered by the tech sector, consulting companies, and business schools. Real economic progress will come not from training LLMs to ace multiple-choice, fill-in-the-blank queries but from having our educational system focus on helping students acquire the critical thinking skills that LLMs lack.

See the original post here:

Sora: Life Is Not a Multiple-Choice Test - Walter Bradley Center for Natural and Artificial Intelligence

Related Posts

Comments are closed.