It is Stupid to Ask How Many R’s ‘Strawberry’ Has

Whenever a new LLM is released, users tend to quiz it first with basic questions like: “How many R’s does ‘Strawberry’ have?” or “Which one is bigger – 9.9 or 9.11?”.

Most models, like GPT-3.5, Claude, and Llama, get the answer wrong. The problem starts when users try to benchmark the reasoning capabilities of a model based on these questions.

Steve Wilson, the CEO of Exabeam, explained that when an LLM processes a word, it doesn’t see it as individual letters but rather as tokens. These tokens may represent entire words or subword units, depending on the model’s design.

“For example, “strawberry” might be broken into one or more tokens that don’t directly correspond to individual letters. This can lead to errors when trying to analyse or count specific letters,” he added further.

Tasks that require multiple steps, such as identifying all instances of a specific letter and their positions, are more prone to errors because each step introduces the possibility of a mistake. If the model fails at one step, it affects the entire output.

Edgar ter Danielyan, the director of Danielyan Consulting, mentioned that LLMs process text as sets of interrelated numbers presenting certain regularities and patterns without any understanding of what the words actually mean or what they can refer to under different circumstances.

“They do not form an internal representation of the world as humans do. In fact, LLMs do not even use words — all words in human languages are converted into numbers at a very early stage called tokenisation, and any meaning we attribute to words is quickly lost, as a word becomes just a set of numbers with very complex relationships with all the other sets of numbers” he further added suggesting that LLMs can not truly understand text.

LLMs don't understand text

It is not Hallucination; it is Tokenisation

LLMs can’t count letters directly because they process text in chunks called “tokens”. Therefore, some may get the “strawberry” question right due to training data, not true understanding. For accurate letter counting, try breaking down the word or using external tools.

A Reddit user explained that if you ask an LLM to count how many times the letter “r” appears in the word “strawberry”. The LLM might see “strawberry” as three tokens: 302, 1618, and 19772. There is no way of knowing that the third token (19772) contains two “r” s.

“Interestingly, some LLMs might get the “strawberry” question right, not because they understand letter counting, but most likely because it’s such a commonly asked question that the correct answer (three) has infiltrated its training data. This highlights how LLMs can sometimes mimic understanding without truly grasping the underlying concept,” he added further.

A research paper titled ‘Large Language Models Lack Understanding of Character Composition of Words’ explained this further, suggesting that fundamental reliance on token-level processing is one of the primary reasons why LLMs struggle to understand character composition stems.

LLMs are primarily trained and operate at the token level, where words or subword units are treated as indivisible elements. This approach, while efficient for many natural language processing tasks, inherently overlooks the intricate relationships between individual characters within words.

Tokenisation methods used in LLMs, such as Byte-Pair Encoding (BPE) or WordPiece, focus on breaking text into larger units rather than individual characters.

As a result, LLMs lack a fine-grained understanding of how characters combine to form words, which is crucial for tasks involving character manipulation or analysis. This token-centric approach creates a significant gap between the model’s representation of language and the actual character-level structure of the text, leading to poor performance on tasks that require an understanding of character composition.

Things are Slowly Changing with System 2 LLMs

When AIM asked similar reasoning questions like counting characters in words to the recently released model o-1 preview, it gave correct answers to questions like how many e’s does the word “enterprise” have. Upon dig up a bit, we realised it used this Python code for the correct answer:

# Let’s count the number of occurrences of the letter’ e’ in the word “enterprise” word = "enterprise" count_e = word.count('e') count_e

A Reddit user has praised o1 models for using chain of thoughts to solve complex problems. He further mentioned how OpenAI demonstrated the o1 model when presented with an encoded message and asked to decode it. The model demonstrated its ability to break down the problem into steps, analyse patterns, and gradually work towards a solution.

This process mimics human-like reasoning, where the model “thinks through” the problem before providing an answer.

However, there are mathematical tasks where o1 models struggle with basic maths questions. Theo Browne, a popular YouTuber, founder and CEO of Ping Labs, recently posted a video on the reasoning capabilities of o1 models. It turned out that the models were unable to find all the possible corners of a parallelogram.
But we believe this is just the beginning and in upcoming months, we will witness models like GPT-5 (which we sort of believe is o1 itself) will be able to handle reasoning more efficiently.

The post It is Stupid to Ask How Many R’s ‘Strawberry’ Has appeared first on AIM.

Follow us on Twitter, Facebook
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 comments
Oldest
New Most Voted
Inline Feedbacks
View all comments

Latest stories

You might also like...