Independent judgement for decision safety.
Independent judgement for decision safety.
How decision safety is assessed.
This is an independent assessment of whether the product metrics and behavioural signals you already rely on can safely support the decisions being made.
The work is not about producing more reporting. It is about interpretation: what the numbers represent, what must be true for that interpretation to hold, and where confidence is being borrowed from unexamined assumptions.
This page describes how Product Metrics Assessment approaches judgement — what is examined, what tends to break over time, and why apparently “correct” reporting can still become decision-unsafe.
It is intentionally non-procedural. This isn’t a repeatable recipe. It describes what gets tested so the judgement can be read and challenged from outside the system.
In stable periods, teams can operate on shared intuition. Under scrutiny, the same metrics must withstand direct questioning:
This is the moment a chart stops being “obvious” and starts needing a defensible explanation in plain language.
These are typical lines of challenge (not steps, and not exhaustive).
- What does this metric actually represent in product terms?
- What must be true for that interpretation to hold?
- What changed recently that could have altered meaning without changing the chart?
- If challenged, can we explain this without relying on insider knowledge?
The goal is not perfect truth. The goal is knowing where the signal is reliable enough to support a decision — and where it is not.
Metrics are rarely “just numbers.” They sit on top of a stack of implicit commitments that often go unspoken once the dashboard exists:
- eligibility: what behaviour is observable at all (e.g. consent boundaries, blockers, platform constraints)
- identity and continuity: when two actions are treated as “the same user” (and when they are not)
- session meaning: what “a session” implies about intent, and how that differs across surfaces
- capture coverage: which behaviours are measured directly vs inferred from proxies
- transforms and aggregation: what gets collapsed, bucketed, filtered, or imputed before it reaches a chart
- change history: schema/tooling/product changes that quietly redefine continuity over time
The assessment focuses on these dependencies because they are where meaning drifts — even when the data pipeline remains technically consistent.
Confidence rarely collapses in a single incident. More often it erodes through familiar patterns:
- like-for-like assumptions: teams treat a renamed event, migrated tool, or redefined surface as continuous
- silent exclusions: consent changes or platform shifts reduce what is observable without a visible ‘break’
- identity fragmentation: the same person appears as multiple users across devices, domains, or auth states
- proxy hardening: a convenience metric becomes a decision metric long after its limits are forgotten
- interpretive drift: definitions remain stable in text while meaning changes in practice
- coverage debt: instrumentation gaps start to look like real user behaviour
These are not implementation mistakes as much as interpretive risks. The system continues to “work,” but the conclusions become less defensible.
The assessment is anchored in the decisions being made, then works outward to the signals those decisions depend on. Evidence varies by context, but the examination tends to include:
- decision context: which decisions rely on which signals, and what ‘being wrong’ would cost
- metric meaning: what the team believes the metric represents, and where that belief came from
- definition stability: whether the same term means the same thing across teams and over time
- observability constraints: what behaviour is missing or selectively visible (and why)
- continuity assumptions: identity/session interpretations that create ‘the user journey’ on paper
- change exposure: where tooling, schema, product, or consent shifts could have changed meaning
- failure surfaces: the specific ways the signal could mislead under scrutiny
The intent is to make interpretive dependencies explicit — so decisions are not being made on top of unknown assumptions.
A useful methods description is not “how we do it.” It is what the work resolves. Typical questions:
- Which signals are safe to rely on, and for which kinds of decisions?
- Where does the signal stop being strong enough to justify action?
- What assumptions are currently carrying decision weight without being named?
- Which disagreements are definitional vs interpretive?
- What failure modes are plausible in this environment — and how would we notice?
- If leadership asked ‘how do you know?’, what would be defensible to say?
The assessment artefact exists to answer these questions in a way that can be forwarded internally without translation.
Teams inside a system naturally normalise its definitions and edge cases. Over time, informal knowledge becomes part of “what the metric means,” even when it is no longer documented or widely shared.
Independent assessment is safer at decision points because it treats assumptions as testable claims, rather than inherited context. The goal is not to catch mistakes. The goal is to reduce decision risk by making meaning explicit and defensible.
This page describes how judgement is formed. It does not provide an execution method, implementation guide, or audit checklist.