RLHF is NOT Really RL

Yann LeCun wholeheartedly agrees. OpenAI co-founder Andrej Karpthy recently expressed disappointment in Reinforcement Learning from Human Feedback (RLHF), saying, “RLHF is the third (and last) major stage of training an LLM, after pre-training and supervised finetuning (SFT). My rant on RLHF is that it is just barely RL, in a way that I think is not too widely appreciated.”

# RLHF is just barely RL
Reinforcement Learning from Human Feedback (RLHF) is the third (and last) major stage of training an LLM, after pretraining and supervised finetuning (SFT). My rant on RLHF is that it is just barely RL, in a way that I think is not too widely… pic.twitter.com/sjRZvqc5KC

— Andrej Karpathy (@karpathy) August 7, 2024

He explained that Google DeepMind’s AlphaGo was trained using actual reinforcement learning (RL). The computer played games of Go and optimised its strategy based on rollouts that maximised the reward function (winning the game), eventually surpassing the best human players. “AlphaGo was not trained with reinforcement learning from human feedback (RLHF). If it had been, it likely would not have performed nearly as well,” said Karpathy.

However, Karpathy agrees that for tasks that are more open-ended, like summarising an article, answering tricky questions, or rewriting code, it’s much harder to define a clear goal or reward. In these cases, it’s not easy to tell the AI what a “win” looks like. Since there’s no simple way to evaluate these tasks, using RL in these scenarios is really challenging.

Not everyone aligns with Karpathy’s view. Pierluca D’Oro, a PhD student at Mila and researcher at Meta, who is building AI agents, argues that AlphaGo has a straightforward objective, to win the match. “Yes, without any doubt RL maximally shines when the reward is clearly defined. Winning at Go, that’s clearly defined! We don’t care about how the agent wins, as long as it satisfies the rules of the game,” D’Oro said.

He explained that as humans will interact with AI agents in the future, it is important for LLMs to be trained with human feedback. “AI agents are designed to benefit humans, who are not only diverse but also incredibly complex, beyond our full understanding,” he said. “For humans, it often comes from things like human common sense, expectations, or honor.”

Here, Karpathy also agrees. “RLHF is a net helpful step in building an LLM assistant,” he said, adding that LLM assistants benefit from the generator-discriminator gap. “It is significantly easier for a human labeller to select the best option from a few candidate answers than to write the ideal answer from scratch,” he explained, citing an example such as ‘generate a poem about paperclips.”

An average human labeller might struggle to create a good poem from scratch as an SFT example, but they can more easily select a well-written poem from a set of candidates.

Karpathy goes on to explain that using RLHF in complex tasks like Go wouldn’t work well because the feedback (“vibe check”) is a poor substitute for the actual goal. The process can lead to misleading outcomes and models that exploit flaws in the reward system, resulting in nonsensical or adversarial behavior.

Unlike true RL, where the reward is clear and directly tied to success, RLHF relies on subjective human judgments, making it less reliable for optimising model performance, he says.

“This is a bad take. When interacting with humans, giving answers that humans like *is* the true objective,” responded Natasha Jaques, senior research scientist at Google AI, to Karpathy’s critique.

She says that while human feedback is limited compared to something like infinite game simulations (e.g., in AlphaGo), this doesn’t make RLHF less valuable. Instead, she suggests that the challenge is greater but also potentially more impactful because it could help reduce biases in language models, which has significant societal benefits.

“Posting this is just going to discourage people from working on RLHF, when it’s currently the only viable way to mitigate possibly severe harms due to LLM biases and hallucinations,” she replied to Karpathy.

Moving Away from RLHF

Yann LeCun from Meta AI has constantly been talking about how the trial-and-error method of RL for developing intelligence is a risky way forward. For example, a baby does not identify objects by looking at a million samples of the same object, or trying dangerous things and learning from them, but instead by observing, predicting, and interacting with them even without supervision.

Meta has been bullish on self-supervised learning for quite some time. Self-supervised learning is ideal only for large corporations like Meta, which possess terabytes of data to train state-of-the-art models.

On the other hand, OpenAI recently introduced Rule-Based Rewards (RBRs), a method designed to align models with safe behaviour without extensive human data collection.

According to OpenAI, while reinforcement learning from human feedback (RLHF) has traditionally been used, RBRs are now a key component of their safety stack. RBRs use clear, simple, and step-by-step rules to assess whether a model’s outputs meet safety standards.

When integrated into the standard RLHF pipeline, RBRs help balance helpfulness with harm prevention, ensuring the model behaves safely and effectively without the need for recurrent human inputs.

Similarly, Anthropic recently introduced Constitutional AI, an approach to train AI systems, particularly language models, using a predefined set of principles or a “constitution” rather than relying heavily on human feedback.

Meanwhile, Google DeepMind, which is known for its paper “Reward is Enough” which claims intelligence can be achieved through reward maximisation, recently introduced another paper detailing Foundational Large Autorater Models (FLAMe).

FLAMe is designed to handle various quality assessment tasks and address the growing challenges and costs associated with the human evaluation of LLM outputs.

Meta, which recently released LLaMA 3.1, opts for self-supervised learning rather than RLHF. For the post-training phase of Llama 3.1, Meta employed SFT on instruction-tuning data along with Direct Preference Optimisation (DPO).

DPO is designed to directly enhance the model’s performance based on human preferences or evaluations, rather than relying solely on traditional reinforcement learning or supervised learning methods.

Meta isn’t stopping there either. It recently published another paper titled “Self-Taught Evaluators,” which proposes building a strong generalist evaluator for model-based assessment of LLM outputs. This method generates synthetic preferences over pairs of responses without relying on human annotations.

Another paper from Meta titled “Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge” allows LLMs to improve by judging their own responses instead of relying on human labellers.

In line with this, Google DeepMind also proposed another new algorithm called reinforced self-training (ReST) for language modelling. It follows a similar process of removing humans from the loop by letting language models build their own policy with a single initial command. While ReST finds application in various generative learning layouts, its expertise lies in machine translation.

The post RLHF is NOT Really RL appeared first on AIM.

Follow us on Twitter, Facebook
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 comments
Oldest
New Most Voted
Inline Feedbacks
View all comments

Latest stories

You might also like...