Are AI agents ready for the workplace? A new benchmark raises doubts

Are AI Agents Prepared for White-Collar Work? New Benchmark Raises Questions

Recent research suggests leading AI models struggle with white-collar tasks, raising doubts about their readiness for workplaces.

Technology

New York: Microsoft CEO Satya Nadella had a big idea about AI replacing white-collar jobs like lawyers and accountants. But it’s been almost two years, and AI hasn’t changed these jobs much.

Some amazing AI models can research and plan, but they have not fixed white-collar work yet. A new study from Mercor helps explain why. The study checks how well these AI models can perform real tasks from consulting, banking, and law. They created a benchmark called APEX-Agents, and results show all AI labs are failing this test. Most models answered less than a quarter of the questions right, and many times, they didn’t answer correctly at all.

Brendan Foody, CEO of Mercor, said the main problem was gathering information from different sources. In real work, we use many tools, like Slack and Google Drive, to complete our tasks. Many AI models struggle with this complex reasoning.

The study used questions created by professionals who work in these fields. One example from the law section was very tricky. It asked if a company could treat some data exports as okay under privacy laws. The answer is yes, but finding this answer needs careful thinking about policies and laws. If AI could answer these kinds of questions correctly, it might take over some lawyer jobs one day.

OpenAI tried its own test for professional skills, but APEX-Agents is different. It focuses on performing tasks in specific, high-value jobs, making it more complex. Some models did better than others, with Gemini 3 Flash getting the highest score at 24%. However, none are ready to replace bankers just yet.

Even though results are low, the field of AI often improves fast. Now that APEX-Agents is public, labs can try to beat it. Foody believes major improvements will come quickly. He said, “It’s like an intern that gets it right a quarter of the time, but last year it only got it right 5 to 10% of the time.”

Image Credits and Reference: https://techcrunch.com/2026/01/22/are-ai-agents-ready-for-the-workplace-a-new-benchmark-raises-doubts/