R-Zero: Self-Evolving Reasoning LLM from Zero Data

(arxiv.org)

3 points | by Anon84 3 days ago

1 comments

vineethy 3 days ago
Interesting twist on automated curriculum learning. This paper is using an LLM for the environment and the policy. Other papers use LLMs for policy/value fn. Would be cool to see other reward strategies tying all these threads together