this post was submitted on 19 Jan 2024
1 points (100.0% liked)

Singularity

15 readers
1 users here now

Everything pertaining to the technological singularity and related topics, e.g. AI, human enhancement, etc.

founded 2 years ago
MODERATORS
 
This is an automated archive.

The original was posted on /r/singularity by /u/Upasunda on 2024-01-19 07:53:27+00:00.


We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While only a preliminary study, this work opens the door to the possibility of models that can continually improve in both axes.

It seems to me that this is truly a huge leap if true and in combination with synthetic data and self-instruction. Accelleration will be the only option.

no comments (yet)
sorted by: hot top controversial new old
there doesn't seem to be anything here