We design and use model-based evaluators to both evaluate and autonomously refine the performance of digital agents. Experiments show that domain-general automated evaluators can significantly improve the performance of digital agents, without any extra supervision.
Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, Alane Suhr
UC Berkeley, University of Michigan