Model/Data associated with research project Autonomous Evaluation and Refinement of Digital Agents.

Paper | Code

We design and use model-based evaluators to both evaluate and autonomously refine the performance of digital agents. Experiments show that domain-general automated evaluators can significantly improve the performance of digital agents, without any extra supervision.

Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, Alane Suhr

UC Berkeley, University of Michigan