What to Measure When You Can't Measure Everything

The Measurement Problem Nobody Solves Cleanly

Every serious effort to understand a complex system eventually runs into the same wall: the things that matter most are the hardest to measure, and the things that are easiest to measure often matter less than they appear to.

This is not a technology problem. Better sensors, more data, more sophisticated analytics — these can reduce measurement cost and improve measurement precision. They do not resolve the underlying problem, which is conceptual: the outcomes most worth tracking (resilience, health, trust, learning, adaptive capacity) are inherently harder to operationalize than the outputs that proxy for them (incident count, yield per hectare, customer satisfaction scores, test passage rates).

Organizations that do not grapple honestly with this problem end up in one of two failure modes. The first is measuring what is easy and optimizing for it, which produces measurable improvements in indicators while the underlying reality those indicators were meant to capture deteriorates. The second is deciding the problem is too hard and not measuring anything systematically, which produces decisions based on impression, narrative, and the opinions of whoever speaks most confidently in the room.

Neither is acceptable. The alternative — measuring things that are genuinely informative, with explicit acknowledgment of what is being measured and what is not — requires a disciplined approach to measurement selection. The Measurement Selection Protocol described in this article is that approach.

What Easy Measurement Costs

Before describing the protocol, it is worth being specific about what easy measurement costs.

Metric optimization without performance improvement. When specific metrics are tied to organizational performance evaluation, the people being evaluated optimize those metrics. This is not dishonesty — it is rational behavior. The problem is that metric optimization and underlying performance improvement are different activities that frequently diverge.

Student test scores improve when teachers teach to the test. Customer satisfaction scores improve when customer service representatives are trained to generate positive responses at the moment of rating. Employee performance scores improve when performance management systems are gamed by experienced managers. Crop yield statistics improve when farmers shift to input-intensive approaches that damage long-term soil health.

In each case, the metric improves while the underlying reality the metric was meant to capture either stays constant or deteriorates. The measurement system has been successfully gamed — not through conspiracy, but through the rational responses of competent people to the incentive structure the measurement system created.

The Goodhart trap. This dynamic is often described as Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. The law describes what happens when measurement systems ignore the gap between the indicator and the thing the indicator is meant to represent. Once that gap is large enough to be worth exploiting, exploitation follows.

The practical implication is not that measurement is futile. It is that measurement systems that ignore the Goodhart dynamic will be gamed, and that measurement selection needs to account for what happens to an indicator once it is used as a target.

Dashboard reality vs. ground reality. In organizations that depend heavily on dashboard metrics for their understanding of what is happening, a gap develops between what the dashboard shows and what is actually occurring. The dashboard shows what the metrics show. What the metrics show is shaped by how they are measured, by what gets reported, and by the rational responses of the people being measured. Ground reality is shaped by all of that plus the things the metrics do not capture.

The gap between dashboard reality and ground reality is widest in the domains where measurement is hardest — exactly the domains where accurate understanding matters most. This is the measurement problem in its most costly form.

The Measurement Selection Protocol

The Measurement Selection Protocol is a four-criterion framework for choosing metrics that are genuinely informative rather than merely measurable.

Criterion 1: Leading vs. Lagging

The first question is whether the proposed metric is a leading indicator or a lagging indicator.

A lagging indicator measures outcomes that have already occurred. Revenue, customer churn, crop yield, student graduation rates, organizational turnover — these are lagging indicators. They tell you what happened. They are important for understanding results, but they provide limited guidance for intervention because by the time they show a problem, the conditions that produced the problem have already had their effect.

A leading indicator measures conditions or behaviors that precede and predict the outcomes of interest. Soil organic matter content predicts long-term agricultural productivity. Employee engagement scores predict retention. Early learning indicators predict later academic performance. Customer product usage patterns predict renewal. Leading indicators are harder to identify — they require understanding of the causal structure that produces the outcomes — but they provide the early warning that makes intervention possible.

A robust measurement system includes both. Lagging indicators confirm whether the outcomes you care about are being achieved. Leading indicators allow you to detect deterioration early enough to respond before it shows up in outcomes.

The selection question is: for the performance domains you care about, what are the leading indicators? This requires explicit causal reasoning about what conditions and behaviors precede the outcomes you care about. It cannot be done by selecting metrics that are easy to collect.

Criterion 2: Actionability

The second question is whether the metric produces information that decision-makers can act on.

A metric is actionable if: it updates frequently enough to be relevant to decisions (not quarterly when decisions are made daily), it is specific enough to indicate what kind of response is appropriate, and it is within the control of the people being asked to respond to it.

A metric is not actionable if it measures outcomes that decision-makers cannot affect (external market conditions, weather, macroeconomic factors), updates so infrequently that it is stale by the time it informs a decision, or aggregates across so many dimensions that it cannot indicate what specifically is driving the reading.

The actionability criterion eliminates a large category of metrics that feel important — overall revenue, brand sentiment, market share — as primary management metrics. These are important outcomes to track, but they are too aggregated and too slow-moving to guide the day-to-day and week-to-week decisions that determine whether those outcomes improve or deteriorate.

Actionability does not mean that a metric must be under complete control of the decision-maker. Agriculture operates in environments where rainfall, temperature, and market prices are outside farmer control. But a measurement system that is mostly composed of metrics outside farmer control is not a decision-support tool — it is a tracking system that tells farmers how their outcomes are going without helping them improve.

Criterion 3: Manipulation Resistance

The third question is how much the metric can be improved through means other than improving the underlying performance it is meant to measure.

All metrics can be gamed. The question is the cost of gaming relative to the cost of genuine improvement. A manipulation-resistant metric is one where gaming requires more effort, creates more risk, or produces worse long-term outcomes than genuine improvement would.

Manipulation resistance is partly a function of how the metric is constructed. Composite metrics that aggregate across multiple components are harder to game than single-component metrics because improving the composite requires improving across all components simultaneously. Metrics that are verified through independent observation are harder to game than self-reported metrics. Metrics that are sampled randomly from a large pool are harder to game than metrics that are applied to a fixed, known population.

But manipulation resistance is also a function of how the metric is used. Metrics tied directly to high-stakes rewards or punishments are under enormous optimization pressure and will be gamed more aggressively than metrics used for learning and improvement. This does not mean high-stakes metrics are wrong — it means that their design needs to anticipate the optimization pressure they will face.

A key design principle: the harder a metric is to game, the better it serves as a target. The easier it is to game, the better it serves as a learning indicator (used by the people being measured to understand their own performance) rather than an accountability indicator (used by others to evaluate those people).

Criterion 4: Aggregate Validity

The fourth question is whether measuring this thing at scale — across many instances, locations, or time periods — produces an aggregate that validly represents the underlying reality you care about.

A metric has aggregate validity if: the thing it measures is consistent across the contexts in which it is being measured (the same metric means the same thing in different schools, farms, or organizations), the aggregation method (average, sum, rate) is appropriate for the distribution of values, and the aggregate is interpretable — a change in the aggregate corresponds to a change in the underlying reality in the direction you expect.

Many commonly used metrics fail this criterion. Average customer satisfaction scores aggregate across interactions that differ enormously in type, context, and stakes. Average teacher effectiveness ratings aggregate across teachers working in very different student populations and resource contexts. Average crop yield statistics aggregate across farms with different soil, water, and market access conditions. The aggregate is computable; its interpretation as a valid representation of underlying performance is not straightforward.

Aggregate validity matters most for measurement systems used at scale — national education metrics, agricultural development program indicators, organizational health benchmarks. At scale, the interpretation of the aggregate has real consequences. Improving a number that does not validly represent the underlying reality produces nothing except the improvement in the number.

Common Measurement Failures in Organizational and Agricultural Contexts

The four criteria identify specific failure patterns that appear consistently across organizational and agricultural measurement contexts.

The activity metric trap. Organizations frequently measure activities when they want to understand outcomes. Training hours instead of skill acquisition. Meeting frequency instead of decision quality. Field visits instead of farmer adoption. Activity metrics are easy to collect, hard to game in obvious ways, and entirely uninformative about whether the activities produced anything useful. The activity-to-outcome causal chain is exactly what needs to be established, not assumed.

The aggregation-without-stratification failure. Agricultural programs that measure average yield improvements without stratifying by farmer type, soil condition, or resource access frequently conclude that interventions that worked for well-resourced, well-situated farmers also worked for everyone. The aggregate improvement is real. Its interpretation as evidence that the intervention works across the full population is not. This is an aggregate validity failure with real consequences: programs are scaled based on evidence that does not support the scale.

The lagging-indicator-only measurement system. Organizations that measure only outcomes have no early warning system. They discover that performance is deteriorating when they see it in the numbers — at which point the conditions that produced the deterioration have been operating for months or years. Bayanihan Harvest's agricultural systems work consistently runs into this: communities that measure only final yield at harvest have no ability to detect mid-season stress until it has already affected the crop. The response is reactive rather than adaptive. Building in leading indicators — soil moisture, plant health indicators, pest pressure early warning — changes the decision timescale from harvest to season to week.

The metric-as-ceiling failure. When metrics are used primarily as thresholds (pass/fail, above/below target), they create an incentive to reach the threshold and stop improving. The metric that was designed to raise the floor becomes a ceiling. Organizations that hit their customer satisfaction target stop investing in customer experience improvements. Farms that achieve the target yield per hectare stop investing in practices that would produce further improvement. The selection criterion for avoiding this is using metrics that are positioned on a continuous scale where more is consistently better, rather than metrics where "passing" creates a natural stopping point.

Measuring What Matters in Practice

The hardest part of applying the Measurement Selection Protocol is the first step: identifying what actually matters.

This is not a measurement question — it is a values and strategy question. What is the organization or system trying to achieve? What does success actually look like, at the level of the outcomes that matter, not the indicators that proxy for them? What would have to be true about the world in two years, five years, ten years for this work to have been worth doing?

These questions are harder to answer than "what can we collect data on?" They require clarity about purpose, theory of change, and the time horizons over which success is assessed. But they are the right starting point. Measurement systems built from this starting point measure things that are informative about what matters. Measurement systems built from "what can we collect data on?" measure things that are legible without necessarily being informative.

The HavenWizards experience in property technology provides a parallel: the most accessible metrics in property search — page views, search queries, listing clicks — are not the most informative metrics about whether the platform is serving buyers and renters well. The most informative metrics — time from search initiation to decision, quality of match between stated preferences and listings engaged, repeat-use rate by outcome type — require more work to define, collect, and interpret. Building measurement systems around the accessible metrics rather than the informative ones produces a dashboard that is easy to fill and hard to learn from.

The discipline is choosing informative over accessible, leading over lagging, actionable over comprehensive, and manipulation-resistant over precise. Each of these choices involves real cost. The alternative — measuring what is easy, optimizing what is measured, and calling the result performance management — has costs too. They are just less visible, paid over longer time horizons, by people other than the ones who built the measurement system.