▲Q-learning is not yet scalableseohong.me

187 points by jxmorris12 16 hours ago | 36 comments

marcosdumay 46 minutes ago [-]

I guess it's worth pointing that when humans learn the long-horizon tasks that we learn by repetitive training, we segment them in tasks with a shorter horizon and compose them hierarchically later.

isaacimagine 1 hours ago [-]

No mention of Decision Transformers or Trajectory Transformers? Both are offline approaches that tend to do very well at long-horizon tasks, as they bypass the credit assignment problem by virtue of having an attention mechanism.

Most RL researchers consider these approaches not to be "real RL", as they can't assign credit outside the context window, and therefore can't learn infinite-horizon tasks. With 1m+ context windows, perhaps this is less of an issue in practice? Curious to hear thoughts.

DT: https://arxiv.org/abs/2106.01345

TT: https://arxiv.org/abs/2106.02039

itkovian_ 7 hours ago [-]

Completely agree and think it’s a great summary. To summarize very succinctly; you’re chasing a moving target where the target changes based on how you move. There’s no ground truth to zero in on in value-based RL. You minimise a difference in which both sides of the equation have your APPROXIMATION in them.

I don’t think it’s hopeless though, I actually think RL is very close to working because what it lacked this whole time was a reliable world model/forward dynamics function (because then you don’t have to explore, you can plan). And now we’ve got that.

lalaland1125 13 hours ago [-]

This blog post is unfortunately missing what I consider the bigger reason why Q learning is not scalable:

As horizon increases, the number of possible states (usually) increases exponentially. This means you require exponentially increasing data to have a hope of training a Q that can handle those states.

This is less of an issue for on policy learning, because only near policy states are important, and on policy learning explicitly only samples those states. So even though there are exponential possible states your training data is laser focused on the important ones.

elchananHaas 11 hours ago [-]

I think the article's analysis of overapproximation bias is correct. The issue is that due to the Max operator in the Q learning noise is amplified over timesteps. Some methods to reduce this bias, such as https://arxiv.org/abs/1509.06461 were successful in improving the RL agents performance. Studies have found that this happens even more for the states that the network hasn't visited many times.

An exponential number of states only matters if there is no pattern to them. If there is some structure that the network can learn then it can perform well. This is a strength of deep learning, not a weakness. The trick is getting the right training objective, which the article claims q learning isn't.

I do wonder if MuZero and other model based RL systems are the solution to the author's concerns. MuZero can reanalyze prior trajectories to improve training efficiency. The Monte Carlo Tree Search (MCTS) is a principled way to perform horizon reduction by unrolling the model multiple steps. The max operator in MCTS could cause similar issues but the search progressing deeper counteracts this.

5 hours ago [-]

Ericson2314 12 hours ago [-]

https://news.ycombinator.com/item?id=44280505 I think that thead might help?

Total layman here, but maybe some tasks are "uniform" despite being "deep" in such a way that poor samples still suffice? I would call those "ergodic" tasks. But surely there are other tasks where this is not the case?

lalaland1125 12 hours ago [-]

Good clarification. I have edited my post accordingly.

There are situations where states increase at much slower rates than exponential.

Those situations are a good fit for Q learning.

jhrmnn 10 hours ago [-]

Is this essentially the same difference as between vanilla regular-grid and importance-sampling Monte Carlo integration?

arthurcolle 12 hours ago [-]

Feel the Majorana-1

whatshisface 15 hours ago [-]

The benefit of off-policy learning is fundamentally limited by the fact that data from ineffective early exploration isn't that useful for improving on later more refined policies. It's clear if you think of a few examples: chess blunders, spasmodic movement, or failing to solve a puzzle. This becomes especially clear once you realize that data only becomes off-policy when it describes something the policy would not do. I think the solution to this problem is (unfortunately) related to the need for better generalization / sample efficiency.

getnormality 14 hours ago [-]

Doesn't this claim prove too much? What about the cited dog that walked in 20 minutes with off-policy learning? Or are you making a more nuanced point?

briandw 12 hours ago [-]

This papers assumes that you know quite a bit about RL already. If you really want to dig into RL, this intro course from David Silver (Deep Mind) is excellent: https://youtu.be/2pWv7GOvuf0?si=CmFJHNnNqraL5i0s

ArtRichards 11 hours ago [-]

Thank you for this link.

abeppu 2 hours ago [-]

The thing that worked for them was reducing the horizon. From my limited and dated understanding, I thought that's what the gamma term was for: exponentially you discount the value of stuff in the future to the point of being negligible (or less than the epsilon of differences you can even represent). So ... why/when is exponential discounting not enough?

4 hours ago [-]

paraschopra 9 hours ago [-]

Humans actually do both. We learn from on-policy by exploring consequences of our own behavior. But we also learn off-policy, say from expert demonstrations (but difference being we can tell good behaviors from bad, and learn from a filtered list of what we consider as good behaviors). In most, off-policy RL, a lot of behaviors are bad and yet they get into the training set and hence leading to slower training.

taneq 5 hours ago [-]

> difference being we can tell good behaviors from bad

Not always! That's what makes some expert demonstrations so fascinating, watching someone do something "completely wrong" (according to novice level 'best practice') and achieve superior results. Of course, sometimes this just means that you can get away with using that kind of technique (or making that kind of blunder) if you're just that good.

Straw 2 hours ago [-]

This post mischaracterizes AlphaZero/MuZero.

AlphaZero/MuZero are not model based, and aren't on policy either. They train at a significantly higher temperature, and thus intentional suboptimal play, than they use when playing to win. LeelaChessZero has further improvements to reduce the bias on training on suboptimal play.

There's a well known tradeoff in TD learning based on how many steps ahead you look- 1-step TD converges off policy, but can give you total nonsense/high bias when your Q function isn't trained. Many-step can't give you nonsense because it scores based on the real result, but that real result came from suboptimal play so your own off-policyness biases the results, plus it's higher variance. It's not hard to adjust between these two in AlphaZero training as you progress to minimize overall bias/variance. (As in, AlphaZero can easily do this- I'm not saying the tuning of the schedule of how to do it is easy!)

bee_rider 51 minutes ago [-]

It is the weekend, so let’s anthropomorphize.

The idea of sub-optimal play to increase learning is interesting. We can notice the human social phenomenon of being annoyed at players who play games “too mechanically” or boringly, and admiration of players (even in an overly studied game like chess) with an “intuitive” style.

I wonder how AI training strategies would change if the number of games they were allowed to play was fairly limited, to the handful of thousands of matches of a game that a person might play over the course of their lives. And perhaps if their “rank” was evaluated over the course of their training, like it is for humans.

pixelpoet 2 hours ago [-]

Probably a dumb/obvious question: if the bias comes from selecting the Q-maximum, can't this simply be replaced by sampling from a PDF?

abeppu 1 hours ago [-]

I think this means you're not longer optimizing for the right thing. My understanding is that max is important because your value of taking action A from state S includes the presumption that you'll pick the best available action under the policy from each successive state you visit. If you swap in some PDF, you're asking what's the value of taking action A if I then act by random sampling from that distribution thereafter.

s-mon 13 hours ago [-]

While I like the blogpost, I think the use of unexplained acronyms undermines the opportunity of this blogpost to be useful to the wider audience. Small nit: make sure acronyms and jargon is explained.

levocardia 54 minutes ago [-]

It's clearly written for an audience of other RL researchers, given than the conclusion is "will someone please come up with Q-learning methods that scale!"

keremk 6 hours ago [-]

For these kinds of blogposts where the content is very good but not very approachable due to assumption of extensive prior knowledge, I find using an AI tool very useful, to explain and simplify. Just used the new browser Dia for this, and it worked really well for me. Or you can use your favorite model provider and copy and paste. This way the post stays concise, and yet you can use your AI tools to ask questions and clarify.

7 hours ago [-]

anonthrowawy 3 hours ago [-]

i actually think thats what made it crisp

andy_xor_andrew 13 hours ago [-]

the magic thing about off-policy techniques such as Q-Learning is that they will converge on an optimal result even if they only ever see sub-optimal training data.

For example, you can use a dataset of chess games from agents that move totally randomly (with no strategy at all) and use that as an input for Q-Learning, and it will still converge on an optimal policy (albeit more slowly than if you had more high-quality inputs)

Ericson2314 12 hours ago [-]

I would think this being true is the definition of the task being "ergodic" (distorting that term slightly, maybe). But I would also expect non-ergodic tasks to exist.

AndrewKemendo 14 hours ago [-]

Q learning isn’t scalable because of the stream barrier, however streaming DRL (TD-Lambda) is scalable:

https://arxiv.org/abs/2410.14606

Note that this is from Turing award winner Richard Sutton’s lab at UofA

RL works

getnormality 13 hours ago [-]

But does this address scaling to long-horizon tasks, which is what the article is about?

AndrewKemendo 2 hours ago [-]

Yes because it’s management of long term reward by default

andy_xor_andrew 13 hours ago [-]

The article mentions AlphaGo/Mu/Zero was not based on Q-Learning - I'm no expert but I thought AlphaGo was based on DeepMind's "Deep Q-Learning"? Is that not right?

energy123 13 hours ago [-]

DeepMind's earlier success with Atari was based on offline Q-Learning

GistNoesis 7 hours ago [-]

The stated problem is getting Off-policy RL to work, aka discover a policy smarter than the one it was shown in its dataset.

If I understand correctly, they show random play, and expect perfect play to emerge from the naive Q-learning training objective.

In layman's term, they expect the algorithm to observe random smashing of keys on a piano, and produce a full-fledge symphony.

The main reason it doesn't work is because it's fundamentally some Out Of Distribution training.

Neural networks works best in interpolation mode. When you get into Out Of Distribution mode, aka extrapolation mode, you rely on some additional regularization.

One such regularization you can add, is to trying to predict the next observations, and build an internal model whose features help make the decision for the next action. Other regularization may be to unroll in your head multiple actions in a row and use the prediction as a training signal. But all these strategies are no longer the domain of the "model-free" RL they are trying to do.

Other regularization, can be making the decision function more smooth, often by reducing the number of parameters (which goes against the idea of scaling).

The adage is "no plan survive first contact with the enemy". There needs to be some form of exploration. You must somehow learn about the areas of the environment where you need to operate. Without interaction from the environment, one way to do this is to "grok" a simple model of the environment (fitting perfectly all observation (by searching for it) so as to build a perfect simulator), and learn on-policy from this simulation.

Alternatively if you have already some not so bad demonstrations in your training dataset, you can get it to work a little better than the policy of the dataset, and that's why it seems promising but it's really not because it's just relying of all the various facets of the complexity already present in the dataset.

If you allow some iterative gathering phase of information from the environment, interlaced with some off-policy training, it's the well known domain of Bayesian methods to allow efficient exploration of the space like "kriging", "gaussian process regression", multi-arm bandits and "energy-based modeling", which allow you to trade more compute for sample efficiency.

The principle being you try model what you know and don't know about the environment. There is a trade-off between the uncertainty that you have because you have not explored the area of space yet and the uncertainty because the model don't fit the observation perfectly yet. You force yourself to explore unknown area so as not to have regrets (Thomson Sampling) ) but still sample promising regions of the space.

In contrast to on-policy learning, the "bayesian exploration learning" learn in an off-policy fashion all possible policies. Your robot doesn't only learn to go from A to B in the fastest way. Instead it explicitly tries to learn various locomotion policies, like trotting or galloping, and other gaits and use them to go from A to B, but spend more time perfecting galloping as it seems that galloping is faster than trotting.

Possibly you can also learn adaptive strategy like they do in sim-to-real experiments where your learned policy is based on unknown parameters like how much weight your robot carry, and your learned policy will estimate on-the-fly these unknown parameters to become more robust (aka filling in the missing parameters to let the optimal "Model Predictive Control" work).

mrfox321 3 hours ago [-]

Read the paper.

They control for the data being in-distribution

Their dataset also has examples of the problem being solved.

toootooo 6 hours ago [-]

How can we eliminate Q-learning’s bias in long-horizon, off-policy tasks?....

Onavo 10 hours ago [-]

Q learning is great as a hello world RL project for teaching undergraduates.

msgodel 41 minutes ago [-]

I feel like the null TD algorithm is much better if you want a "hello world."

Loading comments...

marcosdumay 46 minutes ago [-]

I guess it's worth pointing that when humans learn the long-horizon tasks that we learn by repetitive training, we segment them in tasks with a shorter horizon and compose them hierarchically later.

isaacimagine 1 hours ago [-]

DT: https://arxiv.org/abs/2106.01345

TT: https://arxiv.org/abs/2106.02039

itkovian_ 7 hours ago [-]

lalaland1125 13 hours ago [-]

This blog post is unfortunately missing what I consider the bigger reason why Q learning is not scalable:

elchananHaas 11 hours ago [-]

5 hours ago [-]

Ericson2314 12 hours ago [-]

https://news.ycombinator.com/item?id=44280505 I think that thead might help?

lalaland1125 12 hours ago [-]

Good clarification. I have edited my post accordingly.

There are situations where states increase at much slower rates than exponential.

Those situations are a good fit for Q learning.

jhrmnn 10 hours ago [-]

Is this essentially the same difference as between vanilla regular-grid and importance-sampling Monte Carlo integration?

arthurcolle 12 hours ago [-]

Feel the Majorana-1

whatshisface 15 hours ago [-]

getnormality 14 hours ago [-]

Doesn't this claim prove too much? What about the cited dog that walked in 20 minutes with off-policy learning? Or are you making a more nuanced point?

briandw 12 hours ago [-]

ArtRichards 11 hours ago [-]

Thank you for this link.

abeppu 2 hours ago [-]

4 hours ago [-]

paraschopra 9 hours ago [-]

taneq 5 hours ago [-]

> difference being we can tell good behaviors from bad

Straw 2 hours ago [-]

This post mischaracterizes AlphaZero/MuZero.

bee_rider 51 minutes ago [-]

It is the weekend, so let’s anthropomorphize.

pixelpoet 2 hours ago [-]

Probably a dumb/obvious question: if the bias comes from selecting the Q-maximum, can't this simply be replaced by sampling from a PDF?

abeppu 1 hours ago [-]

s-mon 13 hours ago [-]

levocardia 54 minutes ago [-]

It's clearly written for an audience of other RL researchers, given than the conclusion is "will someone please come up with Q-learning methods that scale!"

keremk 6 hours ago [-]

7 hours ago [-]

anonthrowawy 3 hours ago [-]

i actually think thats what made it crisp

andy_xor_andrew 13 hours ago [-]

the magic thing about off-policy techniques such as Q-Learning is that they will converge on an optimal result even if they only ever see sub-optimal training data.

Ericson2314 12 hours ago [-]

I would think this being true is the definition of the task being "ergodic" (distorting that term slightly, maybe). But I would also expect non-ergodic tasks to exist.

AndrewKemendo 14 hours ago [-]

Q learning isn’t scalable because of the stream barrier, however streaming DRL (TD-Lambda) is scalable:

https://arxiv.org/abs/2410.14606

Note that this is from Turing award winner Richard Sutton’s lab at UofA

RL works

getnormality 13 hours ago [-]

But does this address scaling to long-horizon tasks, which is what the article is about?

AndrewKemendo 2 hours ago [-]

Yes because it’s management of long term reward by default

andy_xor_andrew 13 hours ago [-]

The article mentions AlphaGo/Mu/Zero was not based on Q-Learning - I'm no expert but I thought AlphaGo was based on DeepMind's "Deep Q-Learning"? Is that not right?

energy123 13 hours ago [-]

DeepMind's earlier success with Atari was based on offline Q-Learning

GistNoesis 7 hours ago [-]

The stated problem is getting Off-policy RL to work, aka discover a policy smarter than the one it was shown in its dataset.

If I understand correctly, they show random play, and expect perfect play to emerge from the naive Q-learning training objective.

In layman's term, they expect the algorithm to observe random smashing of keys on a piano, and produce a full-fledge symphony.

The main reason it doesn't work is because it's fundamentally some Out Of Distribution training.

Neural networks works best in interpolation mode. When you get into Out Of Distribution mode, aka extrapolation mode, you rely on some additional regularization.

Other regularization, can be making the decision function more smooth, often by reducing the number of parameters (which goes against the idea of scaling).

mrfox321 3 hours ago [-]

Read the paper.

They control for the data being in-distribution

Their dataset also has examples of the problem being solved.

toootooo 6 hours ago [-]

How can we eliminate Q-learning’s bias in long-horizon, off-policy tasks?....

Onavo 10 hours ago [-]

Q learning is great as a hello world RL project for teaching undergraduates.

msgodel 41 minutes ago [-]

I feel like the null TD algorithm is much better if you want a "hello world."