Does Reinforcement Learning (RL) Scalabale: Hacker News Question
Scaling Q-Learning to Complex Tasks: Algorithmic Innovations to Address Bias Accumulation Through Horizon Reduction
2 min readJun 15, 2025
Summary of “Q-learning is not yet scalable” by Seohong Park:
- Scalability of RL vs. Other Objectives:
- While next-token prediction, diffusion models, and contrastive learning scale well with data and model size, reinforcement learning (RL) struggles to scale similarly, particularly for long-horizon tasks.
- Most RL successes (e.g., AlphaGo, OpenAI Five, RLHF for LLMs) rely on on-policy methods (PPO, REINFORCE), which require fresh data and are impractical for real-world applications like robotics.
2. The Promise and Limitations of Off-Policy RL (Q-learning):
- Off-policy RL (e.g., Q-learning) can reuse past data, making it more sample-efficient.
- However, Q-learning does not scale well to long-horizon problems due to accumulating biases in TD (temporal difference) targets, unlike other scalable objectives (e.g., supervised learning).
3. Empirical Evidence:
- In controlled experiments with complex, long-horizon tasks (e.g., robotic manipulation, puzzle-solving), standard off-policy RL methods (IQL, SAC+BC) failed to improve significantly even with massive datasets (1B samples).
- Horizon reduction techniques (n-step returns, hierarchical RL) helped mitigate bias accumulation, but only by a constant factor — not fundamentally solving scalability.
4. Call for Research:
- Need for algorithmic breakthroughs in off-policy RL to handle long-horizon tasks.
- Potential directions:
- Hierarchical RL with recursive structures (like chain-of-thought in LLMs).
- Model-based RL, combining scalable model learning with on-policy RL.
- Alternative RL formulations (e.g., quasimetric RL, contrastive RL) that avoid TD learning.
Conclusion:
Current Q-learning methods do not yet scale like other ML objectives, but addressing bias accumulation and horizon challenges could unlock RL’s potential for real-world applications (robotics, agents, etc.). The post encourages further research into scalable off-policy RL algorithms.