W37 Papers
W37 Papers
- — Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning
RL
hierarchy
reasoning
Under GRPO, training shows a two-phase dynamic—first the model fixes procedural/execution mistakes (syntax, arithmetic, step fidelity); then the bottleneck shifts to strategic planning (choosing the right approach).