I went down a very deep rabbit hole, so I am going to split up this writing into two parts. Part 1 here will focus on how RL can augment models to learn over time even if we don’t have complete information on the context or human. Part 2 I will discuss how we will be able to remember things more dynamically beyond just the good old RAG.
A while back I poked at the idea of LLMs only having “second hand knowledge”. They feel like super intelligent child forever trapped a box that can only learn from stories, books and parents’ advice. It makes such an impression amongst us, because we finally have a chance to talk to an entity that has such a vast and well connected knowledge of the world around us.
It provides a valuable initial model of the world, bootstrapping it culturally, but it doesn’t replace the value of first hand personal experiences and evolving insights over time. This is important for experiences that we hope to tailor to individuals - since no one truely is an average of human tendencies so we need to address the domain drift per interactions with individuals.
⚠️ The Stateless Problem
What are the reasons why LLM can’t develop this involve characteristics through every interactions with the human?
It’s because LLM is stateless.
Statelessness means each interaction is treated independently - the model does not inherently retain or update nay internal state between interactions. LLMs are essentially massive feed-forward neural networks that predicts the next token based entirely on the current prompt alone. They don’t have memory built in the architecture beyond any information that fits into the limited context window.
It is context dependent but not context retentive. It only looks into the explicit context window at runtime and resets to a neural baseline after every session without persistent memory.
Without the ability to consider the unique past interactions the model has with individuals, it cannot…
learn user preferences over time
adapt to changing environments
improve their performance based on feedback
build on past success or learn form mistakes
This basically means any more sophisticated, personalised interactions that aligns with human over time is severely limited.
🔄 Augmenting models
To overcome this, there are several ways we can augment LLM’s statelessness.
The two augmentations that I am interested in are #1 how to incorporate reinforcement learning into the loop? and #2 how do “retrieval enhance transforms” like DeepSeek’s Latent Attention Model make a more human like behaviour of getting information as it goes about generating tokens.
Let’s tackle #1 this week…
LLM comes pre-trained with immense generalisable knowledge which is great we don’t have to start from scratch. Even with RLHF as we know, it is trained once and deployed. LLMs are mostly static - they don’t learn from ongoing interactions beyond the immediate conversation.
However, humans are complicated and context they are in are often multi-dimensional. The famous RL successes, such as Alpha zero and etc, that are often based on clear goals to align and optimise against is a bit different when it needs to be flexible to different human circumstances. The “credit assignment problem” in reinforcement learning gets super spicy - because you’re no longer dealing with neat, bounded rules or clear feedback loops like a board game. When the environment is open-ended with incomplete information and an enormous parameter space, assigning the credit is hard.
Imagine trying to tell someone it’s important to have a meaningful dialogue, what does that actually mean? How would go about optimising for it?
There are several strategies researchers may lean on, and I found inverse reinforcement learning and intrinsic motivation particular interesting.
Inverse reinforcement learning
Inverse reinforcement learning (IRL) lets the system reverse engineer an implicit reward function by demonstrations or human behaviours. It’s great for capturing subtle or tacit knowledge and is often leveraged in robotics. In Stanford’s "Apprenticeship Learning" (Abbeel & Ng, 2004), they had robots watch expert helicopter pilots and inferred the subtle implicit rewards for stable precise flight. Eventually their robots can demonstrate expert level precision.
This is create for convenience - it can observe recurring patterns and extrapolate sometimes unconscious goals or intentions. However, solely relying on that may not always lead humans to a better place. It assumes that the human existing actions always reflect their true underlying goals or intentions. For example, you may always reach for junk food - it’s easy to interpret junk food is a good reward for the model to reinforce but in reality the human actually wants to be more healthy.
We need something to balance it out…
Intrinsic Motivation
Intrinsic motivation allows the system to occasionally challenge or test existing patterns, explore new actions that isn’’t a part of your habitual behaviours. It lets agents reward themselves for discovering novel or surprising things - almost like a child, helping them to gather new and valuable information.
IRL alone might notice you often Netflix and snacks over an after dinner walk, occluding you prefer comfort and indulgence. But instead of only reinforcing the current patterns, intrinsic motivation encourages the system to ask “Could there be something else you’d value more if you tired it?” The system might occasionally suggests a walk to see if you respond positively - gently testing alternatives to learn deeper unexpressed goals.
At the end, we are trying to balance the design tension of helping humans by respecting existing preference whilst having enough curiosity and courage to gentle challenge the human’s comfort zone.