The race to AGI isn’t just about creating smarter models but also systems that think, adapt, and feel human, especially so in the age of intelligent design. You ask it a question and feel a subtle resonance or a disconnect.
These instinctive “vibe checks,” as OpenAI’s Greg Brockman calls them, are no longer just hunches. Wharton professor Ethan Mollick acknowledges they’re becoming official benchmarks, hinting that a more intuitive approach to AI evaluation is taking shape. Vibe checks are a method for identifying and validating qualitative differences.
OpenAI recently conducted a viral experiment on X, where users asked ChatGPT to generate responses using all the information it had about them. This was based on ChatGPT’s memory feature. The responses took over the internet, and users enjoyed it. In another post, Mollick compared the different AI models.
Are Vibe Evals Good Benchmarks?
A recent study by researchers at UC Berkeley identifies and quantifies qualitative differences, or “vibes,” in the outputs of large language models (LLMs). Most traditional methods focus on metrics like accuracy, clarity, and conciseness. These predefined axes fail to capture open-ended, subjective user preferences.
As models achieve high baseline performance, users often rely on “vibes” to choose between them.
“Just take the landscape today: Claude, GPT, and Gemini are so good that people often go off of ‘vibes’ because (i) all the models are able to do the tasks they want so they are looking for an explanation which is best suited to them or (ii) they are asking really open-ended questions like writing which can’t currently be quantified,” said Lisa Dunlap in an exclusive interview with AIM. She co-authored this paper.
This researcher focuses on qualitative differences, emphasising tone and style over traditional measures like accuracy. Dunlap uses LLM judges to identify and quantify “vibes” that set models apart.
These vibes are evaluated based on consistency, ability to distinguish models, and alignment with user preferences. For this, they introduce the concept of VibeSystem, a framework for evaluating AI models (e.g., Llama-3-70b, GPT-4). The researcher mentioned the selection of models was primarily based on costs.
“When generative models started to become something that I used outside of research just for my day-to-day tasks, it hit me how narrow the current evals are,” she added, emphasising the need for more subjective evals.
The vibes of AI models are Complex, just like Humans
On the whole, vibes-based computing presents challenges related to subjectivity and scalability. “Vibes” are subjective, making automation or standardisation challenging and resource-intensive, especially for large-scale or real-time applications.
“I would say the biggest challenge is making vibes that are well-defined,” said Dunlap.
For instance, humour is subjective and varies by person and culture, making it hard to measure traits like “vibes” accurately. Using multiple language models helps cross-check results, but their biases can still differ from what most people think.
Dunlap highlighted the growing need for collaboration between computer scientists and experts in psychology and education, who offer valuable insights into human interaction and subjective task evaluation.
Interestingly, another insight is how while ‘vibes’ play an important role in user preferences, they are secondary to correctness; a model with an appealing style but incorrect answers is less useful than one that delivers accurate outputs, regardless of its tone. Rather than replacing traditional performance metrics, ‘vibes’ serve as a complement, offering a more comprehensive understanding of model behaviours and their impact on user experience.
But, what does the future hold?
The future of AI evaluation will likely combine data-driven metrics with human intuition. By focusing on user experience, developers can create methods that assess how well AI models align with human expectations and emotions.
Researchers like Dunlap agree that Model evaluations will expand beyond numerical scores like those on MMLU (massive multitask language understanding) to include more subjective traits. The focus will shift from global, standardised evaluations to user-specific evaluations.
In a podcast, OpenAI’s Kevin Weil noted that models today are limited by evaluation methods rather than intelligence, with the potential for greater accuracy and broader task capabilities.
“I think the space of evaluation has grown a lot in the past few years, and there is a lot of money involved. I think what is more lacking is figuring out what to do with all these benchmarks,” said Dunlap on whether evals are underfunded, highlighting that limited budgets require funding to focus on evaluating models on existing benchmarks instead of creating new ones.
Regarding the practical utility of vibe-based benchmarks, Dunlap highlighted that they are most effective for open-ended tasks, such as asking a chatbot like GPT to write a story or leveraging LLMs for customer service.
On similar lines, Reka AI researchers also introduced Vibe-Eval, an open benchmark designed to challenge models like GPT-4 and Claude-3 with nuanced, hard prompts to evaluate traits such as humor, tone, and conversational depth.
“Hard prompts are hard to make,” according to a Reka AI research paper. They say an ideal benchmark hard prompt should be unsolvable by current frontier multimodal language models, interesting or useful to solve, and be error-free and unambiguous for an evaluator.
The post AI Without Vibes is Just Code appeared first on Analytics India Magazine.