LLM benchmarks are known to err, proving they’re not a robust method for evaluating LLMs. Training data contamination, for instance, is a prominent issue with benchmarking. Benchmarks like GLUE, SQuAD and Winograd Schema have seen models overperforming by feeding carefully crafted inputs. The developers building the models, however, evaluate their work on Hugging Face leaderboards against different benchmarks in an attempt to rank at the top.
There are several reasons why these benchmarks are broken, thus messing with the evaluation of AI models in general. One reason is that they are often too narrow in scope. Another reason is that they are often not representative of real-world usage. The datasets used to train LLMs are often not representative of the data that they will encounter in the real world. This can lead to LLMs that perform well on the benchmarks, but poorly in real-world applications.
What are the problems with benchmarks?
MMLU (Massive Multitask Language Understanding) is considered to be the most extensive benchmark. It requires answers in the form of single characters (like A, B, C or D) and the answers should be given immediately by the language model. This can be challenging for complex questions.
A developer on YouTube explains how he found multiple errors in the MMLU test questions. “I was genuinely shocked; there were innumerable factual errors, and I would try to trace down the origin of each and see what the source said. The problem wasn’t just with one source; it was with quite a few of these sources.”
These inevitably impact the results. In some cases, these errors could change the results up to 2%, which is a notable difference in benchmarking. He further explains how just modifying the approach to allow the model to ‘think’ a bit before providing the answers significantly improves the performance. Taking the highest probable answer as the final answer isn’t always the best approach. Instead, considering multiple possible answers and selecting the most common one worked better.
They did this by creating special examples for some subjects to help the model understand better, checking multiple possible answers before choosing the most common one that worked well using a chain of thought process.As a result, they obtained an 88.4% on the MMLU benchmark, although unofficially, and broke the 86.4% as recorded by OpenAI. It was the other way around for Meta’s LLaMA, whose score was significantly lower than the score published in the paper.
This isn’t restricted to just MMLU. Sometime last year, the HellaSwag benchmark that analyses commonsense NLI (Natural Language Inference) was found to have errors in 36% of its rows.
HumanEval only measures if programs made from docstrings work correctly and have a very limited set of capabilities. It consists of 164 original problems in programming and is generally considered to be a measure of the language model’s ability in Python. However, with contamination of either the dataset or the LLM, the entire model can be analysed incorrectly.
Better Benchmarking
A team of researchers from Tsinghua University, Ohio State University, and UC Berkeley has introduced AgentBench, which is a multidimensional benchmark created to assess LLMs-as-Agents in a variety of settings. This is unlike most existing benchmarks that focus on a particular environment, which limits their ability to give a thorough assessment of LLMs across various application contexts.
Realising that different use cases need a separate benchmark, companies offer their solutions for evaluating large language models.
Effective benchmarking is crucial for building AI models, as it directs researchers towards understanding how the system works and what doesn’t. Instead of trying to claim ‘generality’, benchmarks should focus on providing insights about the language model.
“Benchmarking is not about winning a contest but more about surveying a landscape— the more we can re-frame, contextualise and appropriately scope these datasets, the more useful they will become as an informative dimension to better algorithmic development and alternative evaluation methods,” explains the paper.
A thread on Reddit discusses the number of ways to fix these issues.
Benchmarks should be seen as a test to compare how the model works after it has been released, rather than some sort of a goal. This is popularised by the leaderboards on different hosting platforms like Hugging Face or Kaggle. The problem with higher-ranking models is often that they can be too specialised to test specific examples or questions which doesn’t guarantee they perform well outside of the evaluation.
The post The Problems with LLM Benchmarks appeared first on Analytics India Magazine.