W37 Papers

September 8, 2025 — Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning
RL hierarchy reasoning

Under GRPO, training shows a two-phase dynamic—first the model fixes procedural/execution mistakes (syntax, arithmetic, step fidelity); then the bottleneck shifts to strategic planning (choosing the right approach).