A recent research paper, ‘Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level’, introduces ‘Agent K v1.0’. It is the first version of an end-to-end autonomous designed to automate, optimise, and generalise across diverse data science tasks.
The paper claims that large language models (LLMs) can autonomously achieve a performance level comparable to Kaggle Grandmasters, an assertion that has sparked debate in the data science community.
Bojan Tunguz, a four-time Kaggle Grandmaster, took to LinkedIn to express his views on the same.
“The claims of the “Kaggle Grandmaster Level Agent” are total unqualified BS.”
Not Quite a Grandmaster Yet
The researchers from Huawei Noah’s Ark, UCL’s AI Centre and Department of Computer Science, and the Technical University of Darmstadt presented that the agent was tested on a variety of Kaggle competitions.
Agent K claims to achieve a 92.5% success rate across diverse tasks, showcasing capabilities in tabular, computer vision, NLP, and multimodal domains.
The paper also reports that Agent K v1.0 ranked in the top 38% of human competitors, achieving a skill level they argue is on par with expert human users. With high scores in competitions, including a record of gold, silver, and bronze medals, the authors suggest that Agent K has reached the performance level of a Grandmaster.
Referring to this, a data scientist on X said, “If you’re in the top 38% of Kaggle participants, you won’t even be able to become an Expert, let alone a Grandmaster.”
Kaggle competitions demand not only advanced technical skills but also practical experience and a nuanced understanding of data science challenges. As highlighted by AIM earlier, even individuals with extensive theoretical knowledge often find these competitions challenging, signifying the gap between academic learning and practical application.
Calling the research title “misleading”, Santiago Valdarrama, a computer scientist, said on LinkedIn that many of the competitions they used weren’t even real competitions and that it used many manual, hardcoded steps by the authors to help guide the model.
“The system is limited to only certain types of problems that fit the hardcoded guardrails,” he added.
Moreover, in the realm of tabular data analysis, traditional machine learning models like XGBoost continue to outperform more complex models, including LLMs. XGBoost excels in handling structured data efficiently and accurately, often surpassing deep learning techniques in this domain.
AIM had earlier discussed the evolving role of data scientists in light of LLM advancements, suggesting that while LLMs can automate certain tasks, they can not replace the comprehensive skill set required for high-level data science competitions yet.
Agent K Achieves Medals?
To ensure a fair analysis, researchers compared Agent K’s results with 5,856 human competitors in the same competition. They tracked gold, silver, and bronze medal acquisition rates, applying consistent medal criteria to both groups.
Adjustments accounted for medals in competitions where medals weren’t officially awarded, making for a direct comparison. The findings showed that Agent K earned medals more consistently than many Kaggle users, especially in bronze medal wins.
It claimed to outperform human participants in 42% of cases for bronze, compared to 23% where humans had the edge. For gold medals, it won 14% of the time versus 6% for human competitors.
However, as Tunguz pointed out, this claim can be misleading. He argued that the paper’s results overstated the model’s ranking since merely performing well in a few contests is not equivalent to reaching the Grandmaster status.
Achieving a Grandmaster level on Kaggle requires consistent top-tier placements across multiple, highly competitive challenges, often demanding insights and adaptability that LLMs currently lack.
He also said that the main reason none of the tests were completely successful was that they were not done on an active Kaggle competition. The vast majority of the datasets used were toy synthetic datasets for playground competitions.
Based solely on this, the LLM cannot be classified as Kaggle Grandmaster Level. “EVERY single accomplished Kaggler that I know shares these views,” he said.
The Paper Agrees Too!
Interestingly, the paper’s limitations section acknowledges that Agent K v1.0’s performance, while promising, does not entirely equate to human Grandmaster skills.
Its Elo-MMR score, the metric used to benchmark its skill level, placed it below the median of true Grandmasters. This is a notable gap, reflecting the challenges that LLMs still face in tasks that require advanced feature engineering, nuanced understanding, and high adaptability.
Future improvements aim to boost these scores by refining areas where the agent currently lags, potentially closing the gap with elite human competitors.
Furthermore, the paper suggests that Agent K v1.0’s success was partly due to specific optimisations and environmental feedback rather than a generalisable Grandmaster-level capability.
Ultimately, while Agent K demonstrates the potential of LLMs in competitive data science, achieving true Kaggle Grandmaster status autonomously remains out of reach for current AI technology. This sentiment is clearly echoed by data science professionals who emphasise the skill and adaptability required for consistent top-ranking Kaggle performances.
The post ‘Kaggle Grandmaster Level Agents are Total Unqualified BS’ appeared first on Analytics India Magazine.