Concept

Data sourcing & validation

cross

Overview

Data sourcing and validation is the build-time discipline of giving every number in a model a recorded origin and a validation state — finding a value, noting where it came from and on what basis, rating it against the source hierarchy, and flagging it as unconfirmed until a stronger source or an expert checks it. It is the practice that produces provenance during the build, before any outside reader ever asks.

Body

The loop, per input. Each number passes through four moves: source it (find a value and record its origin and basis), rate it against the source hierarchy and the accuracy class it implies, flag it as unvalidated when it rests on a weak or assumed source, and clear the flag when a better source or an expert confirms it. The running list of open flags is a validation queue — the model’s own record of what it has not yet confirmed.

Sourcing records origin and basis. A value without its basis — the cost year, the scale, the system boundary it was measured on — is only half-sourced: it is attributed but not transferable to the case being modeled. The record has to carry both, or the figure cannot be safely reused.

Validation is distinct from sourcing. Sourcing records where a number came from; validation confirms it is adequate against a stronger or independent source. The two come apart: a number can be fully sourced (traceable) yet unvalidated (its source not yet judged good enough), and that gap is exactly what the flag marks.

The flag is information, not an embarrassment. Marking a figure unvalidated states plainly that the result currently leans on an unconfirmed input, and it localizes where confirmation effort would actually change how much the answer can be trusted. An unflagged model is not more finished — it has merely hidden which of its numbers are still guesses.

The state of the result tracks the state of its drivers. Because the answer rides on a few drivers, the validation state of those inputs sets the validation state of the conclusion; a fully-sourced model with one unvalidated driver has an unvalidated result.

Limits & typical error

Sourcing is not validation. Recording an origin does not confirm the value; a fully-attributed number can still be an unchecked guess. Attribution makes an error findable, not absent.
Sourcing the value but not the condition leaves it non-transferable. A credit rate captured without its eligibility rules, or a cost without its cost-year and scale, is attributed and still cannot be carried to the modeled case — the basis is part of what must be sourced.
An unflagged assumption masquerades as data. Failing to mark a stated anchor as an assumption lets a modeling choice read as a sourced measurement, overstating how grounded the model is.
Volume of sourcing is not coverage. Sourcing many minor inputs while a driver stays unvalidated leaves the result undefended; coverage is measured on the load-bearing few, not the count of citations.
The flag and the text drift apart. Clearing a value in the model but leaving its flag set — or the reverse — desyncs the recorded validation state from the real one, so the queue stops describing the model.
“Confirmation” against a circular or same-tier source is not validation. Checking a number against a source that traces back to the model’s own estimate, or against another value at the same weak tier, produces the appearance of validation without the substance.

Mini-example

Building the green-ammonia model, three load-bearing inputs get different treatment. The electricity intensity (~~10 MWh/t NH₃) is sourced to electrolyzer datasheets and rated a strong tier — close to validated. The electricity price (~~$40/MWh) is recorded as a stated market anchor and flagged unvalidated, pending a real market or PPA source to confirm the level and its range. The total capex (~$1.0bn) is flagged pending a vendor quote. Read together, the open flags say the ~$800/t headline currently rests on the price and the capex anchors — pointing confirmation effort exactly where it would change the result’s trust, rather than at the already-solid intensity.

Separately, to show sourcing without validation: marking the ~$540/t clean-hydrogen credit “sourced” to the policy’s headline rate, while leaving its eligibility conditions (the carbon-intensity tier, hourly-matched clean power) unflagged, records the value but not the condition the plant must meet — a number that is sourced yet not validated for whether it can actually be earned.