Unveiling the Reasons: Why Most AI Benchmarks Offer Limited Information
On Tuesday, startup Anthropic released a family of generative AI models that it claims achieve best-in-class performance. Just a few days later, rival Inflection AI unveiled a model that it asserts comes close to matching some of the most capable models out there, including OpenAI’s GPT-4, in quality.
Anthropic and Inflection are by no means the first AI firms to contend their models have the competition met or beat by some objective measure. Google argued the same of its Gemini models at their release, and OpenAI said it of GPT-4 and its predecessors, GPT-3, GPT-2 and GPT-1. The list goes on.
But what metrics are they talking about? When a vendor says a model achieves state-of-the-art performance or quality, what’s that mean, exactly? Perhaps more to the point: Will a model that technically “performs” better than some other model actually feel improved in a tangible way?
On that last question, not likely.
The issue, or rather, the challenge, exists within the benchmarks AI companies use to measure a model’s robustness and limitations.
The popular benchmarks used presently for AI models – namely chatbot-driven models like OpenAI’s ChatGPT and Anthropic’s Claude – fail to accurately portray how average individuals engage with the models under scrutiny. For instance, a benchmark referenced by Anthropic in their recent declaration, GPQA (“A Postgraduate Google-Proof Q&A Benchmark”), contains countless Ph.D. level questions on biology, physics, and chemistry. However, most people use chatbots for tasks such as responding to emails, writing cover letters, and discussing their emotions.
Jesse Dodge, an AI scientist at the Allen Institute for AI, a non-profit AI research institute, mentions that the sector is undergoing an “evaluation crisis”.
“Benchmarks are often static and normally focus on assessing a single capability, like a model’s accuracy in a single field, or its competence at solving mathematical reasoning multiple-choice questions,” Dodge said in a TechCrunch interview. “Most benchmarks used for evaluation are over three years old, from when AI systems were primarily for research and had few actual users. Plus, individuals use generative AI in various creative ways.”
Despite the fact that the most commonly employed benchmarks are not entirely pointless, their efficacy is dwindling as AI models are being branded as universal solutions. Traditional benchmarks, which were once relevant, are becoming less applicable given their limited usage range. For instance, tasks frequently tested by these benchmarks – like solving elementary math problems or spotting anachronisms in sentences – are not practically useful to the majority of users.
According to David Widder, a postdoctoral researcher specializing in AI and ethics at Cornell, the issue stems from a shift in how AI systems are perceived and applied. Older AI systems, designed for specific problem-solving within certain contexts (like medical AI expert systems), enabled a more personalized understanding of performance assessment within those contexts.
However, with the advancement of ‘general purpose’ systems, performance is broad-based, leaning towards evaluation through various benchmarks across different fields. However, this raises another concern: whether these benchmarks accurately measure what they are supposed to assess.
An analysis of HellaSwag, a test designed to evaluate commonsense reasoning in models, found that more than a third of the test questions contained typos and “nonsensical” writing. Elsewhere, MMLU (short for “Massive Multitask Language Understanding”), a benchmark that’s been pointed to by vendors including Google, OpenAI and Anthropic as evidence their models can reason through logic problems, asks questions that can be solved through rote memorization.
Test questions from the HellaSwag benchmark.
“[Benchmarks like MMLU are] more about memorizing and associating two keywords together,” Widder said. “I can find [a relevant] article fairly quickly and answer the question, but that doesn’t mean I understand the causal mechanism, or could use an understanding of this causal mechanism to actually reason through and solve new and complex problems in unforseen contexts. A model can’t either.”
So benchmarks are broken. But can they be fixed?
Dodge holds the opinion that the correct approach involves a blend of assessment standards and human evaluation. She proposes utilizing a genuine user query to prompt the model, followed by hiring a person to assess the response’s quality.
On the other hand, Widder expresses less confidence in the potential for current benchmarks, even after rectifying noticeable errors such as typos, to be enhanced to a level where they would be useful for a significant number of generative AI model users. In his view, model testing should prioritize examining the end impacts of these models and determining whether these impacts are viewed favorably by those affected.
He suggests, “We should identify the particular contextual goals we want AI models to be utilized for and assess their success, or potential success, in such scenarios. Hopefully, this process also involves determining whether the usage of AI in these contexts is appropriate.”
Discover the pinnacle of WordPress auto blogging technology with AutomationTools.AI. Harnessing the power of cutting-edge AI algorithms, AutomationTools.AI emerges as the foremost solution for effortlessly curating content from RSS feeds directly to your WordPress platform. Say goodbye to manual content curation and hello to seamless automation, as this innovative tool streamlines the process, saving you time and effort. Stay ahead of the curve in content management and elevate your WordPress website with AutomationTools.AI—the ultimate choice for efficient, dynamic, and hassle-free auto blogging. Learn More