85% of orgs say their biggest AI challenge is… measuring ...

The Illusion of Activity Versus the Reality of Impact You have deployed a customer service chatbot. The dashboard shows impressive numbers: 10,000 conversations handled, an average resolution time of ...

85% of orgs say their biggest AI challenge is… measuring ...

The Illusion of Activity Versus the Reality of Impact

You have deployed a customer service chatbot. The dashboard shows impressive numbers: 10,000 conversations handled, an average resolution time of 45 seconds, a 95% user satisfaction score from the post-interaction survey. The project is hailed as a success. Yet, six months later, the volume of escalations to human agents hasn't budged, and customer complaints about being trapped in unhelpful conversational loops are rising. The metrics you tracked were precise, meticulously gathered, and utterly misleading. This scenario is not an exception; it is the rule. A recent survey by a major consultancy found that 85% of organisations cite measuring the business impact of AI as their single greatest challenge. The problem is not a lack of data, but a profound misalignment between what we can easily measure and what actually matters. We conflate operational efficiency with strategic value, tracking the speed of the engine while ignoring the direction of the train.

This failure in measurement is fundamentally a failure in applied leadership and strategic decision-making. Leaders, under pressure to demonstrate ROI on substantial AI investments, often reach for the most accessible quantitative proxies: uptime, latency, transaction volume, or shallow engagement metrics. These are output metrics, not outcome metrics. They tell you the system is running, not that it is creating value. The applied leader must shift the conversation from "Is the AI working technically?" to "Is the AI driving the business outcome we need?" This requires a ruthless focus on linking AI performance to pre-existing business key performance indicators (KPIs), whether that's customer lifetime value, net promoter score, production yield, or employee retention. The challenge is that this linkage is messy, multivariate, and often requires controlled experimentation to isolate the AI's true effect—a step most organisations, satisfied with vanity metrics, never take.

Why Vanity Metrics Are The Default (And How They Deceive)

Vanity metrics are seductive because they are easy to collect, easy to visualise, and almost always move in the right direction post-implementation. Consider a new AI-powered tool for sales teams that analyses call transcripts and suggests next steps. It's simple to track adoption (logins per user), activity (features used per session), and user ratings. These numbers will likely look good initially, driven by novelty and mandate. However, they are dangerously disconnected from the ultimate goal: increasing sales. A team could be highly "adopted" and "active" while using the tool to justify poor sales practices or, worse, wasting time on low-probability leads the AI mis-prioritised. The vanity metrics create an illusion of progress, insulating the project from critical scrutiny because the dashboard is green. This is where leadership judgement must override the comforting glow of analytics screens.

The deception deepens because these metrics often become the de facto target for the data science and engineering teams building the AI. Incentivised to improve system accuracy (e.g., making better next-step suggestions), they optimise their models against historical data. This can lead to a phenomenon known as "Goodhart's Law": when a measure becomes a target, it ceases to be a good measure. The AI gets better at predicting what a salesperson *did* in historical successful calls, not necessarily what they *should do* to create future success. It reinforces past patterns, potentially missing novel strategies or failing to adapt to a changing market. The applied leader must ensure the team's success metrics are explicitly and causally linked to business outcomes, even if those metrics are harder to measure. This might mean running a pilot where half the sales team uses the AI and the other half operates as a control group, comparing not activity, but final conversion rates and deal sizes over a full quarter.

The Proxy Trap in People Analytics

A stark example of the vanity metric trap is in AI-driven HR and "people analytics." Tools promise to predict employee attrition, often using proxies like decreased calendar activity, reduced logins to internal systems, or changes in communication tone. A leader might be alerted that an employee has a "85% flight risk" score. Acting on this alone is perilous. The metric is a proxy for a human reality—disengagement—but it is not the reality itself. Perhaps the employee is working deeply on a complex project offline, is on a legitimate quiet holiday, or is dealing with a personal matter. Basing a retention conversation solely on the AI's score can damage trust and accelerate the very departure you hoped to prevent. The measurement here must be a trigger for human conversation, not a replacement for it. The impact metric is not the accuracy of the prediction, but the subsequent reduction in unwanted attrition after thoughtful, manager-led interventions.

Designing Impact Metrics That Drive Real Decisions

Escaping the vanity metric trap requires a disciplined, backward-designed approach to measurement. Before a single line of model code is written, the applied leader must force an answer to the question: "What business decision will this AI inform or automate, and how will we know if that decision is better?" This shifts the focus from the AI as a "project" to the AI as a "decision-making agent." For instance, an AI for predictive maintenance shouldn't be measured on its fault prediction accuracy alone. The true impact metric is the reduction in unplanned downtime hours and the associated operational cost savings, net of the new costs of preventative maintenance actions the AI triggers. This requires establishing a baseline (downtime hours before AI) and comparing it to the period after, while controlling for other factors like production volume.

This process is inherently cross-functional. The data science team understands the model's confidence intervals; the operations team understands the cost of downtime and planned maintenance; finance understands how to net the costs. The leader's role is to synthesise these perspectives into a single, agreed-upon impact framework. This framework should include leading indicators (e.g., "percentage of alerts acted upon within 24 hours") and lagging outcome metrics (e.g., "mean time between failures"). Critically, it must also measure potential negative outcomes. Did the AI cause an increase in unnecessary maintenance ("false positives") that wasted resources? Measuring impact holistically means accounting for trade-offs, a core tenet of sound decision-making under uncertainty. The metric suite becomes a balanced scorecard that reflects the true, multifaceted effect of deploying intelligence into a complex system.

The Crucial Role of Counterfactuals and Controlled Experiments

To truly measure impact, you must answer a question that is inherently unobservable: what would have happened *without* the AI? This is the counterfactual. The most robust way to establish it is through a randomised controlled trial (RCT), the gold standard in both science and increasingly in business. For example, if you deploy an AI tool to help underwriters assess risk, randomly assign half your underwriters to use the tool (the treatment group) and half to proceed as normal (the control group). After a sufficient period, compare the outcomes: not just processing speed, but the actual loss ratios and profitability of the policies written by each group. The difference, statistically adjusted, is the causal impact of the AI.

Many leaders balk at this, citing lost opportunity or operational complexity. However, the cost of *not* running a controlled experiment is far higher: the cost of scaling a system based on faith in correlated vanity metrics. You may invest millions in an enterprise-wide rollout of a tool that has zero or even negative net impact. An RCT provides definitive evidence for decision-making. It moves the conversation from "the dashboard looks good" to "we are 95% confident this AI increases profitability by 2-5%." This level of clarity is transformative. When full RCTs aren't feasible, quasi-experimental methods, like difference-in-differences analysis comparing similar business units that adopted at different times, can provide strong evidence. The principle remains: leadership must demand evidence of causality, not just correlation, before committing to strategic scale.

Implementing an A/B Test for a Marketing AI

Imagine an AI that personalises website promotions in real-time. The easy metric is click-through rate (CTR). But the business impact is revenue. A robust approach is to run an A/B test: 50% of visitors see promotions chosen by the new AI (Variant A), and 50% see promotions from the old rule-based system (Variant B). You must run the test long enough to capture not just clicks, but downstream conversions and average order value. Analysing this requires basic but crucial data science. Using Python, you wouldn't just compare mean revenue; you'd perform a statistical test (e.g., a t-test) to see if the difference is significant.

```python import numpy as np from scipy import stats # Simulated revenue per user data (in £) revenue_control = np.random.normal(loc=45.0, scale=15.0, size=5000) # Old system revenue_treatment = np.random.normal(loc=48.5, scale=15.0, size=5000) # New AI system # Perform independent t-test t_stat, p_value = stats.ttest_ind(revenue_treatment, revenue_control) mean_control, mean_treatment = revenue_control.mean(), revenue_treatment.mean() lift = ((mean_treatment - mean_control) / mean_control) * 100 print(f"Control Mean Revenue: £{mean_control:.2f}") print(f"Treatment Mean Revenue: £{mean_treatment:.2f}") print(f"Observed Lift: {lift:.2f}%") print(f"P-value: {p_value:.4f}") if p_value < 0.05: print("Result is statistically significant. AI likely has a real impact.") else: print("Result is not statistically significant. Observed lift could be due to chance.") ```

This simple analysis moves the discussion from "The AI is engaging" to "We are confident the AI drives a significant revenue lift." This is the language of business impact.

From Measurement to Governance: Building an Impact-First Culture

Ultimately, solving the measurement challenge is not a one-time project but a cultural shift towards impact-first AI governance. This requires applied leadership to institute new processes. Every proposal for an AI/ML initiative should begin with an "Impact Hypothesis" document. This document must state the primary business outcome, the causal mechanism, the key counterfactual metric, and the plan for measurement (e.g., pilot design, data collection plan). Review gates for continued funding should be based on evidence of progress against these impact metrics, not technical milestones alone. This creates accountability that aligns engineers, data scientists, and business stakeholders from the outset.

Furthermore, leaders must champion intellectual honesty in reporting. Create forums where teams are safe to discuss why metrics are flat or falling, to analyse false positives and negative side effects. This is where the real learning happens. A culture that punishes "bad" results will simply incentivise the selection of rosy vanity metrics. Instead, reward rigorous measurement and learning, even if it kills a pet project. This builds organisational muscle memory for distinguishing between activity and achievement. Over time, this disciplined approach to measurement becomes your greatest strategic advantage, ensuring that your AI investments are deliberate engines of value, not just expensive sources of distracting data.

Actionable Takeaways for the Applied Leader

The statistic that 85% of organisations struggle to measure AI impact is a symptom of a deeper issue: the separation of technical execution from business accountability. Bridging this gap is the core work of modern applied leadership. You cannot outsource the understanding of value to the data science team; you must own the framework that defines it. Start by auditing your current AI initiatives. For each one, ask: "What is the primary business outcome?" If the answer is a technical or activity metric, you have identified a risk. Work with your teams to trace a logical, causal path from the AI's function to a top-line or bottom-line result.

Institutionalise rigour by mandating that new projects define their measurement strategy, including a plan for establishing a counterfactual, before receiving significant funding. Encourage, and resource, controlled experiments. Make "What was the effect?" a more important question than "How does it work?" Finally, shift your own review rhythms. Spend less time on dashboards of system health and more time on analyses that compare business performance between AI-assisted and non-assisted processes. This focus forces alignment across the organisation, from the data scientist tuning a model to the frontline manager using its output. By demanding evidence of causal impact, you move beyond the hype cycle and build a sustainable, value-driven AI capability that genuinely enhances decision-making and competitive advantage. The measure of your success in AI will not be the sophistication of your algorithms, but the tangible improvements they drive in the metrics that have always mattered to your business.