How decision safety is assessed.

This is an independent assessment of whether the product metrics and behavioural signals you already rely on can safely support the decisions being made.

The work is not about producing more reporting. It is about interpretation: what the numbers represent, what must be true for that interpretation to hold, and where confidence is being borrowed from unexamined assumptions.

Services Trust & Fit

What this page is

This page describes how Product Metrics Assessment approaches judgement — what is examined, what tends to break over time, and why apparently “correct” reporting can still become decision-unsafe.

It is intentionally non-procedural. This isn’t a repeatable recipe. It describes what gets tested so the judgement can be read and challenged from outside the system.

Why metrics fail under scrutiny

In stable periods, teams can operate on shared intuition. Under scrutiny, the same metrics must withstand direct questioning:

This is the moment a chart stops being “obvious” and starts needing a defensible explanation in plain language.

These are typical lines of challenge (not steps, and not exhaustive).

What does this metric actually represent in product terms?
What must be true for that interpretation to hold?
What changed recently that could have altered meaning without changing the chart?
If challenged, can we explain this without relying on insider knowledge?

The goal is not perfect truth. The goal is knowing where the signal is reliable enough to support a decision — and where it is not.

The interpretive dependencies most teams inherit

Metrics are rarely “just numbers.” They sit on top of a stack of implicit commitments that often go unspoken once the dashboard exists:

eligibility: what behaviour is observable at all (e.g. consent boundaries, blockers, platform constraints)
identity and continuity: when two actions are treated as “the same user” (and when they are not)
session meaning: what “a session” implies about intent, and how that differs across surfaces
capture coverage: which behaviours are measured directly vs inferred from proxies
transforms and aggregation: what gets collapsed, bucketed, filtered, or imputed before it reaches a chart
change history: schema/tooling/product changes that quietly redefine continuity over time

The assessment focuses on these dependencies because they are where meaning drifts — even when the data pipeline remains technically consistent.

How confidence degrades

Read Services →

Confidence rarely collapses in a single incident. More often it erodes through familiar patterns:

like-for-like assumptions: teams treat a renamed event, migrated tool, or redefined surface as continuous
silent exclusions: consent changes or platform shifts reduce what is observable without a visible ‘break’
identity fragmentation: the same person appears as multiple users across devices, domains, or auth states
proxy hardening: a convenience metric becomes a decision metric long after its limits are forgotten
interpretive drift: definitions remain stable in text while meaning changes in practice
coverage debt: instrumentation gaps start to look like real user behaviour

These are not implementation mistakes as much as interpretive risks. The system continues to “work,” but the conclusions become less defensible.

What is examined

The assessment is anchored in the decisions being made, then works outward to the signals those decisions depend on. Evidence varies by context, but the examination tends to include:

decision context: which decisions rely on which signals, and what ‘being wrong’ would cost
metric meaning: what the team believes the metric represents, and where that belief came from
definition stability: whether the same term means the same thing across teams and over time
observability constraints: what behaviour is missing or selectively visible (and why)
continuity assumptions: identity/session interpretations that create ‘the user journey’ on paper
change exposure: where tooling, schema, product, or consent shifts could have changed meaning
failure surfaces: the specific ways the signal could mislead under scrutiny

The intent is to make interpretive dependencies explicit — so decisions are not being made on top of unknown assumptions.

Questions the work is trying to answer

A useful methods description is not “how we do it.” It is what the work resolves. Typical questions:

Which signals are safe to rely on, and for which kinds of decisions?
Where does the signal stop being strong enough to justify action?
What assumptions are currently carrying decision weight without being named?
Which disagreements are definitional vs interpretive?
What failure modes are plausible in this environment — and how would we notice?
If leadership asked ‘how do you know?’, what would be defensible to say?

The assessment artefact exists to answer these questions in a way that can be forwarded internally without translation.

Independent assessment is often safer at decision points than self-auditing

Teams inside a system naturally normalise its definitions and edge cases. Over time, informal knowledge becomes part of “what the metric means,” even when it is no longer documented or widely shared.

Independent assessment is safer at decision points because it treats assumptions as testable claims, rather than inherited context. The goal is not to catch mistakes. The goal is to reduce decision risk by making meaning explicit and defensible.

Scope note

This page describes how judgement is formed. It does not provide an execution method, implementation guide, or audit checklist.

Services

Bounded assessment formats tied to specific decision risks.

Trust & Fit

When this helps, and when it doesn’t.

Request an assessment

Optional. How contact works, and what happens next.

Back to homepage

Problem recognition and what the assessment produces.