I'm definitely not on the side that AI models can outright replace humans for a large variety of economic tasks. However, one issue I have with the approach of this study is that instead of using agents integrated with various tools, they just used direct computer use agents, where all the agent gets is some file manipulation tools and then the ability to use the mouse and keyboard, and take screenshots of the screen. As a result, they're not really testing agents in the kind of harnesses they would actually be deployed in to perform these jobs, nor in, like, the harnesses they're really mostly RL'd to be most effective in. So it feels more like they're just testing the competence of computer use RL, not the competence in various economic sectors of the models themselves.
For instance, I'm not at all surprised that models would struggle to use just the mouse, the keyboard, and taking screenshots to produce a viable 3D model. But then, on the other hand, we have stuff like this: https://www.trylad.com/
The study looks at a wide range of different tests spanning many different areas of expertise and output types. Some of the tests, like the web vis tasks used Sonnet not Opus (which was not out at the time). It is similar to testing a car to do many different things, but only one of the tests is the actual driving somehwhere and many of the others are based of the fabric used in the interior. This gives a very broad "96% failure" while missing the observation of the successes. Of course AI can't do everything, and nor can I.
One of the most interesting observations about AI is the timescale at which the favorite model and favorite task changes. Before November I found Sonnet to be interesting, but not moving that much of the needle. Once Opus came out it was clear the needle was not only moving, but moving fast.
It matches my experience using AI developing software. It is a super useful tool but also really crap at doing anything outside of its training data. There is zero real understanding or thinking going on behind the curtain.
I'm definitely not on the side that AI models can outright replace humans for a large variety of economic tasks. However, one issue I have with the approach of this study is that instead of using agents integrated with various tools, they just used direct computer use agents, where all the agent gets is some file manipulation tools and then the ability to use the mouse and keyboard, and take screenshots of the screen. As a result, they're not really testing agents in the kind of harnesses they would actually be deployed in to perform these jobs, nor in, like, the harnesses they're really mostly RL'd to be most effective in. So it feels more like they're just testing the competence of computer use RL, not the competence in various economic sectors of the models themselves.
For instance, I'm not at all surprised that models would struggle to use just the mouse, the keyboard, and taking screenshots to produce a viable 3D model. But then, on the other hand, we have stuff like this: https://www.trylad.com/
The study looks at a wide range of different tests spanning many different areas of expertise and output types. Some of the tests, like the web vis tasks used Sonnet not Opus (which was not out at the time). It is similar to testing a car to do many different things, but only one of the tests is the actual driving somehwhere and many of the others are based of the fabric used in the interior. This gives a very broad "96% failure" while missing the observation of the successes. Of course AI can't do everything, and nor can I.
One of the most interesting observations about AI is the timescale at which the favorite model and favorite task changes. Before November I found Sonnet to be interesting, but not moving that much of the needle. Once Opus came out it was clear the needle was not only moving, but moving fast.
It matches my experience using AI developing software. It is a super useful tool but also really crap at doing anything outside of its training data. There is zero real understanding or thinking going on behind the curtain.