The Challenge of Reviewing AIs and Why TechCrunch is Taking it On Anyway
Every week seems to bring with it a new AI model, and the technology has unfortunately outpaced anyone’s ability to evaluate it comprehensively. Here’s why it’s pretty much impossible to review something like ChatGPT or Gemini, why it’s important to try anyway, and our (constantly evolving) approach to doing so.
The tl;dr: These systems are too general and are updated too frequently for evaluation frameworks to stay relevant, and synthetic benchmarks provide only an abstract view of certain well-defined capabilities. Companies like Google and OpenAI are counting on this because it means consumers have no source of truth other than those companies’ own claims. So even though our own reviews will necessarily be limited and inconsistent, a qualitative analysis of these systems has intrinsic value simply as a real-world counterweight to industry hype.
Let’s first look at why it’s impossible, or you can jump to any point of our methodology here:
The pace of release for AI models is far, far too fast for anyone but a dedicated outfit to do any kind of serious assessment of their merits and shortcomings. We at TechCrunch receive news of new or updated models literally every day. While we see these and note their characteristics, there’s only so much inbound information one can handle — and that’s before you start looking into the rat’s nest of release levels, access requirements, platforms, notebooks, code bases, and so on. It’s like trying to boil the ocean.
Gratefully, our readers (greetings, and much thanks) focus more on elite models and major announcements. While Vicuna-13B is indeed exciting for scholars and creators, barely anyone is utilizing it for routine uses, like how they use ChatGPT or Gemini. And it’s not undermining Vicuna (or Alpaca, or any other similar names) — these are investigation models, so we can disregard them for now. After eliminating over 90% models due to limited scope, we still have more than we can handle.
The explanation behind this is that these expansive models aren’t merely fragments of software or hardware that can be analyzed, graded, and cast aside, similar to comparing two gadgets or cloud solutions. They are not just models but also platforms, having dozens of discrete models and services incorporated into or appended to them.
For instance, when you ask Gemini for directions to a recommended Thai restaurant nearby, it doesn’t simply consult its training data and provide an answer; after all, the likelihood of it finding a document in its database that explicitly describes those directions is virtually nought. Instead, it unnoticeably sends inquiries to a variety of other Google services and sub-models, creating the impression of a single entity reacting to your query straightforwardly. The chat interface is merely a modern frontend for a vast and ever-changing array of services, both AI-driven and otherwise.
Therefore, the Gemini, or ChatGPT, or Claude we test today might not be identical to the one you use tomorrow, or even concurrently! And due to these companies being clandestine, untrustworthy, or both, we aren’t truly aware of when and how these modifications occur. A review stating Gemini Pro fails at accomplishing certain tasks may become irrelevant if Google discreetly issues a patch to a sub-model the following day, or incorporates secret optimization instructions, enabling it to now achieve the task.
Google’s best Gemini demo was faked
Imagine carrying out task X then extending it to X+100,000. As platforms, AI systems are capable of accomplishing almost any task they’re presented with, irrespective of whether their developers initially planned it or not, or even whether the models were designed for it in the first place. Testing them comprehensively becomes an impossible task because even if a million people use them daily, they never actually reach the endpoint of the system’s capabilities – or lack thereof. Developers regularly come across this with the constant emergence of unexpected functions and undesirable edge cases.
Moreover, the methods and databases used by these companies for internal training are treated as closely-guarded secrets. Processes that are crucial for the mission tend to flourish when they can undergo audits and inspections by impartial experts. However, it remains unclear whether, for example, OpenAI used thousands of unlawfully obtained books to provide ChatGPT with its remarkable prose abilities. It’s equally ambiguous why Google’s image model diversified an assembly of 18th-century slave owners. While we may have a vague idea, the exact reasons are unknown. The companies may offer vague apologies for this, but they’ll never truly let us see what occurs behind the scenes, considering there’s no benefit to them doing so.
Despite these complexities, does this imply that it’s impossible to evaluate AI models in any way? No, it’s absolutely possible, but the process is far from simple.
Visualize an AI model as a baseball player. While a large number of baseball players have diverse skills such as cooking, singing, mountain climbing or even coding, the main concern is whether they can efficiently hit, field, and sprint as these are the essential components of the game and quantifiable to a certain extent.
This principle can also be applied to AI models. While they’re capable of performing countless tasks, only a few of them are considered critical and utilized by millions on a regular basis. With this perspective, there exist several ‘synthetic benchmarks’ – as they’re usually referred to – that evaluate a model’s proficiency in answering trivia questions, solving coding problems, navigating through logic puzzles, detecting prose’s errors, or identifying bias or toxicity.
An instance displays benchmark results from Anthropic.
These evaluations typically generate a report, in most cases a number or a concise numerical series, which depicts how well they performed compared to other models. Having these is helpful, however, their functionality is somewhat restricted. AI model designers have become adept at “teaching to the test” (technology imitating life) and targeting these measures so they can highlight success in their press releases. Furthermore, since the testing is usually done privately, corporations only need to reveal the outcomes of tests in which their model scored well. Thus, benchmarks are inadequate yet non-trivial in AI model evaluation.
Could we have foreseen the “historical inaccuracies” produced by Gemini’s image generator? This software hilariously brings us a wide range of founding fathers, despite their reputation of being mainly rich, white, and racist. Its output is being used as proof of an increasingly politically correct mindset in AI. How can we gauge the “naturalness” of writing or emotional language without consulting humans for their views?
Why most AI benchmarks tell us so little
These “emerging qualities”, as firms prefer to label these peculiarities or imprecise characterizations, matter once identified, but until that point, they are unknowable.
Coming back to the baseball player analogy; it’s as if every match introduces a new variable. Reliable hitters are suddenly failing due to their lack of dance skills. The solution? A good dancer, even though they may not perform well on the field. Now, a pinch contract evaluator who can also play third base is necessary.
What artificial intelligence systems claim to achieve, what they are actually utilized for, who is leveraging them, what can be verified, and who conducts those verifications — all of these factors are continually changing. It’s vital to note how extremely volatile this area of study is! The game has shifted from baseball to Calvinball — but the need for an umpire still persists.
Facing the continual onslaught of AI public relations nonsense daily often leads to cynicism. It’s easy to overlook the fact that there are individuals seeking to accomplish exciting or ordinary tasks, and are led to believe by the world’s largest and wealthiest corporations that AI can facilitate these tasks. The reality is that they simply cannot be trusted. Similar to other large businesses, they are promoting a product or packaging you as one. They will go to any length to conceal this reality.
It’s probable that we’re overstating our humble strengths, but our team’s main incentives are simply being truthful and making ends meet, with the hope that one results in the other. None of us invests in these (or any) corporations, the CEOs aren’t personal acquaintances of ours, and we tend to be suspicious of their claims and immune to their temptations (and occasional threats). I frequently find myself in direct opposition to their objectives and techniques.
As technology journalists, we naturally question whether the assertions made by these businesses hold water. Though our means to analyze these claims may be restricted, we undertake our own examination of the primary models. This is because we seek first-hand experience. Our testing method isn’t as standardized as running a series of automated benchmarks but a process more akin to giving them a test run in the routine manner an average person would. We then provide an experiential assessment on each model’s performance.
For instance, we may pose the same current-affairs query to three different models. The outcome isn’t merely a pass or fail situation, or differing scores like 75 versus 77. Their responses may vary in quality, not just accuracy, in aspects that users care about. Does one model respond with more certainty or in a more organized manner? Is one excessively formal or casual on the discussed topic? Does one reference or incorporate original sources more effectively? Which one would be more suitable for a researcher, a specialist, or an ordinary user?
Such characteristics are difficult to measure but would be unmistakably apparent to any human observer. The problem is that not everyone has the chances, the hours, or the impulsion to highlight these disparities. But we generally have a majority of these criteria!
A few queries are certainly not a sufficient review, and we aim to make this point clear. However, as we have underscored, conducting a “comprehensive” review of these matters is virtually unworkable and benchmark stats ordinarily do not provide much insight to the casual user. So, we’re aiming for an approach that is a step above a casual check but not as detailed as a comprehensive review. Even then, we aspire to introduce some uniformity to the process so that we are not simply improvising every time.
Our testing methodology is designed to provide a broad understanding of an AI’s capacity without delving into specific, sometimes unreliable details. As part of this process, we utilize various prompts, which are regularly updated but generally remain the same. These prompts are viewable in any of our reviews. However, to avoid repetitive explanations in each review, we’re providing a detailed overview of our categories and their justifications here.
It’s worth noting that these prompts are broad lines of questioning, allowing the testers to phrase them in whatever manner feels natural to them and to provide follow-up queries as they see fit.
After posing several questions to the model, conducting follow-ups, evaluating other users’ experiences, and comparing findings with the company’s claims, we compile our review. This review summarizes our experience with the AI, identifying its strengths, weaknesses, and unusual or absent responses during our test. Here’s an example of our process in action, as seen in Kyle’s recent examination of Claude Opus.
You can read more about the actual testing in our review of Anthropic’s new chatbot, which left us somewhat underwhelmed.
It’s just our experience, and it’s just for those things we tried, but at least you know what someone actually asked and what the models actually did, not just “74.” Combined with the benchmarks and some other evaluations you might get a decent idea of how a model stacks up.
We should also talk about what we don’t do:
There you have it. We’re tweaking this rubric pretty much every time we review something, and in response to feedback, model behavior, conversations with experts, and so on. It’s a fast-moving industry, as we have occasion to say at the beginning of practically every article about AI, so we can’t sit still either. We’ll keep this article up to date with our approach.
Discover the pinnacle of WordPress auto blogging technology with AutomationTools.AI. Harnessing the power of cutting-edge AI algorithms, AutomationTools.AI emerges as the foremost solution for effortlessly curating content from RSS feeds directly to your WordPress platform. Say goodbye to manual content curation and hello to seamless automation, as this innovative tool streamlines the process, saving you time and effort. Stay ahead of the curve in content management and elevate your WordPress website with AutomationTools.AI—the ultimate choice for efficient, dynamic, and hassle-free auto blogging. Learn More