The Massachusetts Institute of Technology (MIT) has implemented an innovative technique called ‘test-time training’ (TTT) on a fine-tuned Llama 3 8B model, achieving a record 61.9% accuracy on the abstraction and reasoning corpus (ARC) benchmark.
The current leader for this benchmark has scored 55% on the test. MIT views this progress as a significant leap towards accomplishing ‘human-like’ problem-solving skills in LLMs.
“Our TTT pipeline, combined with an existing method (BARC), achieves state-of-the-art results on the ARC public set and performs comparably to an average human,” said the researchers.
The ARC of Intelligence
François Chollet, creator of Keras and a former Google researcher, built the ARC-AGI benchmark, and said that it is “the only AI benchmark that measures progress towards general intelligence”.
The ARC-AGI benchmark consists of novel problems to evaluate an LLM’s logical reasoning abilities. The model is expected to solve a visual puzzle, which involves recognising the pattern from a set of input-output examples and then solving the test questions based on the model’s understanding.
The model is expected to deliver a pixel-perfect and accurate representation.
The inputs and outputs consist of a grid, each square of which can be one of ten colours. The grid can be any height or width between 1×1 and 30×30. This format ensures the test avoids any cultural or linguistic dependencies.
“If found, a solution to ARC-AGI would be more impactful than the discovery of the transformer. The solution would open up a new branch of technology,” mention the creators of the benchmark.
Source: ARC-AGI Prize
General-purpose models struggled to score high in the ARC-AGI benchmark. OpenAI’s o1 Preview scored under 10%, while Anthropic’s Claude 3.5 Sonnet scored below 25%.
The current leader, MindsAI, achieved a record score of 55% on the test by using a technique that fine-tunes the model at the time of testing. That said, MindsAI remains at the top of the leaderboard despite MIT scoring 62%. This is due to the fact that MIT did not train on the private ARC-AGI data set and failed to complete the task within a 12-hour time limit, which is a requirement to challenge the top spot on the leaderboard.
MIT’s Dark Horse
While MindsAI fine-tuned the model during testing, MIT trained the parameters using low-rank adaptation (LoRa) and initial fine-tuning on a publicly available ARC-AGI dataset. The dataset contains several input-output examples from the test.
During fine-tuning, the model also augments the data set, by using the leave-one-out approach. This involves omitting one example, learning from the rest, and predicting an output for the omitted example. This strengthens the model’s understanding of the ARC problem dataset.
Next, TTT is introduced while the model solves a real test case. As per the usual ARC-AGI test, it is presented with a set of input-output examples and an input that needs to be solved. It produces several transformations or variations based on grid size, dimension, colour, or orientation of the inputs in the examples and the test problem.
For each input transformation, the model predicts a set of outputs. Based on the frequency, it ‘votes’ for a top prediction. It further evaluates the list of top predictions across transformations, retrieves the accurate output, and transforms it back to the original input style.
“Aggregating across these transformations through voting procedures leads to significant improvements. This suggests that some tasks may be easier to solve in their transformed versions and that using self-consistency (voting) for aggregation is generally beneficial,” the authors further said.
The authors also performed the test by integrating TTT into an existing technique called Bayesian Abstract Reasoning Corpus (BARC), which already scored a high score on the ARC-AGI test. BARC provides a structured baseline for abstract reasoning by using Python to generate ARC-based tasks, which optimises the model so that it performs well on the benchmark.
Test Time Techniques to Fastrack AGI?
Optimising an AI model to perform well on a specific benchmark always raises concerns. It raises the question of whether narrow use case models will be effective in real-world applications. However, it is worth noting that all general-purpose models were initially trained to solve a narrow set of problems and use cases.
As the underlying data corpus grows, the boundaries between specialised and general-purpose models tend to blur. Models designed for ARC, while presently optimised for a narrow benchmark, could eventually generalise their reasoning capabilities with exposure to more varied data sets and tasks.
“For LLMs, the priority seems to be defined by the number of occurrences and not by the actual ground truth about real-world facts. So, training with highly distilled data seems to be the best we can do for now. But how much of this distilled data do we need to counterweight all the possible wrong conclusions that an LLM might hallucinate?” questioned a user on Reddit, while talking about the future potential of models using ‘distilled’ data.
Despite the concerns, ARC-AGI is still a legitimate test. “While not perfect, ARC-AGI is still the only benchmark that was designed to resist memorisation—the very thing LLMs are superhuman at—and measure progress to close the gap between current AI and AGI,” mentioned the developers of the test.
An important takeaway is the promise of test time techniques. “Our findings suggest that test-time methods could play a pivotal role in advancing the next generation of LLMs,” said the authors.
Earlier, OpenAI used ‘Test Time Compute’ and ‘Train Time Compute’ to enhance the reasoning capabilities of the o1 Preview model by providing more computational resources during training and inference. “We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute),” said OpenAI, in a blog post announcing the o1.
“People underestimate how powerful test-time compute is: compute for longer, in parallel, or fork and branch arbitrarily—like cloning your mind 1,000 times and picking the best thoughts,” said Peter Welinder, VP of product at OpenAI.
The post AGI Won’t Happen Without Test-Time Training appeared first on Analytics India Magazine.