Sitemap

Does Reinforcement Learning (RL) Scalabale: Hacker News Question

Scaling Q-Learning to Complex Tasks: Algorithmic Innovations to Address Bias Accumulation Through Horizon Reduction

2 min readJun 15, 2025
https://news.ycombinator.com/item?id=44279850
https://seohong.me/blog/q-learning-is-not-yet-scalable/

Summary of “Q-learning is not yet scalable” by Seohong Park:

  1. Scalability of RL vs. Other Objectives:
  • While next-token prediction, diffusion models, and contrastive learning scale well with data and model size, reinforcement learning (RL) struggles to scale similarly, particularly for long-horizon tasks.
  • Most RL successes (e.g., AlphaGo, OpenAI Five, RLHF for LLMs) rely on on-policy methods (PPO, REINFORCE), which require fresh data and are impractical for real-world applications like robotics.

2. The Promise and Limitations of Off-Policy RL (Q-learning):

  • Off-policy RL (e.g., Q-learning) can reuse past data, making it more sample-efficient.
  • However, Q-learning does not scale well to long-horizon problems due to accumulating biases in TD (temporal difference) targets, unlike other scalable objectives (e.g., supervised learning).

3. Empirical Evidence:

  • In controlled experiments with complex, long-horizon tasks (e.g., robotic manipulation, puzzle-solving), standard off-policy RL methods (IQL, SAC+BC) failed to improve significantly even with massive datasets (1B samples).
  • Horizon reduction techniques (n-step returns, hierarchical RL) helped mitigate bias accumulation, but only by a constant factor — not fundamentally solving scalability.

4. Call for Research:

  • Need for algorithmic breakthroughs in off-policy RL to handle long-horizon tasks.
  • Potential directions:
  • Hierarchical RL with recursive structures (like chain-of-thought in LLMs).
  • Model-based RL, combining scalable model learning with on-policy RL.
  • Alternative RL formulations (e.g., quasimetric RL, contrastive RL) that avoid TD learning.

Conclusion:

Current Q-learning methods do not yet scale like other ML objectives, but addressing bias accumulation and horizon challenges could unlock RL’s potential for real-world applications (robotics, agents, etc.). The post encourages further research into scalable off-policy RL algorithms.

https://github.com/seohongpark/horizon-reduction

--

--

noailabs
noailabs

Written by noailabs

Tech/biz consulting, analytics, research for founders, startups, corps and govs.

No responses yet