The most important hiring question of the next decade is one no enterprise is asking yet and one no platform is built to answer.

The economic unit of work is moving from a person to a person operating a stack of AI agents. By the end of 2027, on every meaningful project at every meaningful company, the productive human will not be writing the code, the report, the analysis, or the brief. They will be orchestrating a fleet of agents that does and verifying, governing, and shipping the work those agents produce.
This is not a forecast. It is already true at the leading edge of software engineering, legal research, financial analysis, and customer support. Within twenty-four months, it will be true everywhere.
Which raises a question every Chief Human Resources Officer at a Fortune 500 company will be asked, by their CEO, in 2026 or 2027:
Which of our employees and which of our candidates can actually wield agents productively, safely, and at scale?
No platform answers this question today. Not Eightfold. Not Phenom. Not HireVue. Not Workday. Not Microsoft Viva. Not the agent-evaluation tools that have emerged for developers Braintrust, Langfuse, Galileo. Each of these tools solves a real problem. None of them solves this one.
This post is about the gap. We call it Human + Agent Workforce Evaluation. It is a new category, and the enterprise that masters it first, the platform that enterprises adopt to do it will define how an entire generation of hiring decisions are made.
The enterprise hiring stack today was designed for a workforce that no longer exists.
Applicant tracking systems track candidates through stages, they were never built to evaluate them. Talent intelligence platforms like Eightfold map yesterday's skills graph onto tomorrow's roles, but the roles of tomorrow are not described by yesterday's skills. A candidate's history of "Python" or "SQL" tells us nothing about whether they can decompose a complex problem across five agents, catch a hallucinated tool call before it ships to production, or stay within a token budget while doing it.
Video interview platforms like HireVue commoditise human evaluation. They were built when the question was "can this person do the work themselves?" The question has changed. The new question is "can this person direct a stack of agents to do the work, and catch the agents when they're wrong?"
Agent observability tools Braintrust, Langfuse, Galileo, OpenAI's evaluation framework are excellent at evaluating agents as products. They tell engineering teams whether their agent passed its evals. But they treat agents as systems, not as worker-tools. A CHRO cannot use Braintrust to make a hiring decision. The output is technical, not workforce-oriented; there is no link to compensation bands, career paths, internal mobility, or compliance documentation.
And Microsoft Viva and SAP SuccessFactors the workforce platforms enterprises already own, are HR systems of record. They store who works where, who reports to whom, and what training they have completed. They do not measure capability against agent-mediated work.
Three mature markets, none of them built to answer the question that matters most.
If the question is can this person wield agents productively, safely, and at scale?, the answer has to be measured against real work, with real agents, in a sandbox that resembles the actual job. Self-reported skills do not survive contact with reality. Multiple-choice tests fail at the first hallucination. Resume screens are not even on the relevant axis.
In practice, the capability splits cleanly along four dimensions.
Orchestration. Can the candidate decompose a goal into the right set of agent calls? Choose the right tool for the right step? Recover gracefully when an agent fails or returns the wrong shape of output? Orchestration is the difference between an operator and a prompt-tourist. Most candidates today fall into the second category and don't know it. The signal is invisible until they are tested in a live multi-agent environment.
Verification. Can the candidate catch a hallucinated agent output, a fabricated citation, a tool call that looks right but is wrong? This is the single highest-stakes capability in regulated work. A pharma QA engineer who cannot detect when an agent has fabricated a regulatory clause is a liability the FDA will eventually find. A risk analyst who cannot spot a hallucinated counterparty is a liability the regulator will find faster. Verification cannot be assessed by asking; it can only be tested by injecting deliberate failures into a live session and measuring the response.
Decomposition. Can the candidate break down a multi-agent task with appropriate scope, parallelism, and hand-offs between agents? Decomposition predicts the ability to run the kind of one-person workflow that is now possible with modern agentic frameworks, the workflow where one operator coordinates work that previously required a team of ten.
Economics and Governance. Does the candidate know which model to use for which step, a cheap model for a routine sub-task, an expensive model for a hard one? Do they manage token costs as they manage their own time? Do they avoid leaking sensitive data into prompts? Do they preserve an audit trail when they should? In regulated industries, this is not optional. It is the difference between a tool an enterprise can deploy and a tool that triggers a compliance event.
These four dimensions, orchestration, verification, decomposition, governance are not aspirational. They are testable. With the right environment, the right rubrics, and the right deliberate failure injections, they are measurable to the millisecond and defensible against regulatory review.
The gap will close eventually for every industry. It will close first for the regulated ones, pharmaceuticals, financial services, healthcare, aviation, energy. Three forces converge there.
The first is regulatory pressure. Pharmaceutical companies operate under GAMP 5 and 21 CFR Part 11. Banks operate under Basel III and a thicket of national regulators. Healthcare under HIPAA and NABH. Insurance under Solvency II. The European Union's AI Act takes effect in stages through 2026 and 2027. The Indian DPDP Act took effect in 2024. The ISO 42001 standard for AI management systems is being adopted across enterprises that need a defensible posture. Every one of these frameworks asks the same question: can you document, audit, and explain the decisions your AI systems make? That question is currently asked of the systems. It will soon be asked of the humans operating them. An auditor who finds that a regulated firm cannot demonstrate which employees are qualified to operate which agents will write that finding into the report. The first such finding will set the precedent. The second will set the standard.
The second force is operational stakes. In a regulated industry, a mis-hire is not a productivity cost, it is a compliance event. A pharma QA engineer hired without verification capability is a future FDA observation. A risk analyst at a bank without verification capability is a future regulator letter. The cost of one mis-hire in these industries is measured in millions of dollars, not in onboarding overhead.
The third force is competitive timing. Regulated industries move slowly until they don't. When the regulator publishes guidance, every firm in the sector adopts the same standard within a year. The first firm to operationalise a defensible framework for evaluating its agent-augmented workforce, the first to be able to say "yes, every employee who deploys an agent in production has been certified against a standard that we can document for you", wins the regulator's confidence. The rest follow.
This is why we believe the category will form first in pharmaceuticals and financial services. Not because the technology is unique to those sectors, but because the cost of not solving the problem is highest there.
The enterprises that adopt Human + Agent Workforce Evaluation early will need to do three things differently.
First, they will need to move evaluation upstream. Today, agent capability is something most enterprises pretend to assess via interview questions or self-reported portfolios. Tomorrow, they will need a live evaluation environment as a default step in hiring, a sandbox where the candidate operates real agents on real tasks, with real failure injections, and produces a deterministic, audit-grade record. This is closer in form to a pilot's flight simulator than to a coding interview, and it will eventually feel as obvious in regulated hiring as the simulator feels in aviation.
Second, they will need to extend the evaluation framework from candidates to their existing workforce. The same framework that screens new hires is the framework that certifies the people already on payroll. Internal mobility, promotion, reskilling, each becomes a question of measured agent fluency, not self-assessment.
Third, they will need to make the framework auditable. Self-attested certifications are not defensible. The evaluation must produce a tamper-resistant record of what was tested, what model versions were used, what failures were injected, what the candidate detected, and how the score was computed. Every regulator will eventually ask for this record. Every acquirer will eventually ask for it during due diligence. Every board will eventually ask whether the firm's most critical roles are held by people who can actually do the work the firm has assumed they can do.
None of this is hypothetical. The frameworks for it are being built right now, in 2026, by the small number of teams that have understood that the category exists.
Exterview is an agentic talent intelligence platform built natively on the Microsoft AI Cloud. We run the full hiring lifecycle for regulated enterprises in pharmaceuticals, financial services, and healthcare, the industries where mis-hires are compliance events and where the cost of getting evaluation wrong is greatest.
We are building the first dedicated evaluation layer for the Human + Agent Workforce. It measures the four dimensions described above, orchestration, verification, decomposition, governance, in a live, sandboxed, regulator-grade environment, with deterministic scoring and an audit trail designed for GAMP 5, ISO 42001, the EU AI Act, and DPDP. We call it EX Fluency. It enters closed beta in the third quarter of 2026 with two design partners in pharmaceuticals and US enterprise hiring. General availability is the first quarter of 2027.
If you are running a regulated workforce and the question of who can wield agents productively, safely, and at scale has reached the executive team, or if it has not yet but you suspect it will, we would like to talk to you about being a design partner. The window in which this category is undefined will close, and the firms that shape the framework now will shape the standard later.
The broader point of this post is not Exterview. It is the question itself. Even if you never speak to us, ask the question inside your own organisation. Ask which of your employees can verify a hallucinated tool call. Ask which of your candidates have ever been tested under deliberate failure injection. Ask whether your evaluation framework would survive contact with a regulator. The answers will tell you more about your firm's readiness for the next decade than any other single diagnostic available to you in 2026.
The human + agent workforce is not coming. It is here. The only question is who measures it.
Manish Surapaneni is co-founder and CEO of Exterview. Exterview is an agentic talent intelligence platform built natively on the Microsoft AI Cloud. EX Fluency the human + agent workforce evaluation layer enters closed beta in Q3 2026.