Skip to content

W37 Papers

W37 Papers

Under GRPO, training shows a two-phase dynamic—first the model fixes procedural/execution mistakes (syntax, arithmetic, step fidelity); then the bottleneck shifts to strategic planning (choosing the right approach).