Validation Architecture

Five validation paths, one Power Event Record

Spark-XC sits above existing GPU, workload, facility, grid, and finance systems to validate, authorize, and prove AI power actions. Every governed action is checked across five independent validation paths and committed as a Power Event Record. Mission Control executes. Spark-XC validates.

Explore the Pipeline → Request an AI Power Event Replay

Stack Position

Where SPARK-XC sits in your AI infrastructure

AI infrastructure already runs vendor stacks that execute power actions — NVIDIA Mission Control, DCGM, schedulers, DCIM, BMS, and facility controls. Spark-XC does not compete with them.

It sits above those systems as a governance layer: validating, authorizing, and proving each power action across GPU, workload, facility, grid, and finance before and after it reaches hardware — then committing a Power Event Record.

Key architectural property

Governance is layered above the vendor stack: every action is validated across five independent paths and proven with a Power Event Record — without replacing Mission Control, DCGM, or your schedulers.

SPARK-XC Governance Layer

1 · GPU Telemetry Validation

TELEMETRY

2 · Workload / Scheduler Context

WORKLOAD

3 · Facility Power Correlation

FACILITY

4 · Policy / Approval Gates

POLICY

5 · Evidence Chain (PER)

PROVE

GPU & Workload Control

EXECUTE

Mission Control · DCGM · Slurm · K8s · Run:ai

VENDOR

Facility & Grid — DCIM · BMS · PDU/UPS · Utility

FACILITY

GPU Hardware

PCIe x16

Governance Posture

Spark-XC sits above — and stamps every action

Governance is layered above the execution path. As each action passes the Govern layer, Spark-XC stamps it with a Power Event Record. Because it validates and proves rather than gating execution inline, the stack keeps running even if governance is offline.

Governance offline?

Govern (Spark-XC) validate · authorize · prove

PER OFFLINE

Execute (Mission Control)DCGM · scheduler

FacilityDCIM · BMS · PDU

HardwareGPU

ACTIONPER

Spark-XC sits above execution: it stamps each action with a Power Event Record without sitting in the control path.

Governance offline: execution still flows Execute → Hardware — but no PER is produced for those actions. Spark-XC proves; it does not block, so resilience is preserved and the gap is itself recorded once governance returns.

Validation Paths

Each validation path, explained

PATH 01

GPU Telemetry Validation

Captures NVML/DCGM pre- and post-action snapshots — power, clocks, utilization, and temperature — to confirm what the GPU actually did against what was requested.

NVML / DCGM

PATH 02

Workload / Scheduler Context

Pulls job and throughput context from Slurm, Kubernetes, and Run:ai, so every power action is validated against the workload it affects — not just the raw register.

Slurm · K8s · Run:ai

PATH 03

Facility Power Correlation

Correlates GPU-side power with DCIM, BMS, PDU/UPS, and utility signals where available — tying a GPU-level action to its rack, room, and grid impact.

DCIM · BMS · Grid

PATH 04

Policy / Approval Gates

Authority, rate, scope, and oscillation gates evaluate every action and authorize, modify, defer, or block it before it reaches hardware. Gates are enterprise-configurable.

Authorize · Block

PATH 05

Tamper-Evident Evidence Chain

Each action is committed as a Power Event Record, SHA-256 hash-chained to its predecessor on an append-only chain (ARIV). Any insertion, deletion, or modification breaks the chain — and every PER is independently replayable.

PER · SHA-256 chain

Failure Isolation

Designed for any path to fail

The SPARK-XC architecture assumes failure. The five validation paths are independent, and every degradation is itself committed to the evidence chain as a Power Event Record — so a missing signal is proven, not silently dropped.

Independent Paths

Telemetry, workload, facility, policy, and evidence are validated independently. A gap in one path — say, no DCIM signal available — narrows what can be proven, but does not silently pass the action through.

Fail-Closed Gates

When required context is missing or a gate cannot confirm authority, rate, scope, or oscillation limits, the policy path can defer or block the action rather than authorize blind.

Failure Is Evidence

A failure in any path is itself a recorded event. The evidence chain captures fault conditions — including the governance layer's own degradation — so the audit trail stays complete even when a path degrades.

Governs, Doesn't Replace

Spark-XC sits above Mission Control, DCGM, schedulers, DCIM, and BMS — so even if the governance layer is offline, the vendor stack continues to execute. Out-of-band changes (e.g. a direct nvidia-smi -pl) are detected and committed to the chain.

Graceful Degradation

Watchdogs, circuit breakers, and an explicit DEGRADED mode govern the validation layer's own health. If validation stops entirely, the underlying vendor stack remains in control and the last committed state is preserved.

Independently Replayable

The validation properties are externally measurable. GPU telemetry snapshots, facility correlation, and the SHA-256 evidence chain are each verifiable, and every Power Event Record can be replayed by an operator, auditor, or CFO.

Technical Detail

The Power Event Record, concretely

Every governed power action emits one Power Event Record — an evidence bundle answering whether it was approved, safe, auditable, and financially real. Each PER includes a timestamp, action parameters, a telemetry snapshot, and a SHA-256 hash computed over the entry concatenated with the previous entry's hash — anchoring it to an append-only, tamper-evident chain (ARIV) that is independently replayable (HMAC signing available when a key is configured).

// SPARK-XC Power Event Record — ARIV chain entry (schematic) { "seq": 14820, "timestamp_us": "2025-09-14T09:14:02.118Z", "path": "POLICY_APPROVAL_GATE", "action": "SET_POWER_LIMIT", "requested_w": 350, "enforced_w": 300, "readback_w": 300, "delta_w": 0, "prev_hash": "2e57a3...419f", "entry_hash": "f33c91...8b02" // SHA-256(entry || prev_hash) }

// SPARK-XC Power Event Record — ARIV chain entry (schematic)
{
  "seq":         14820,
  "timestamp_us": "2025-09-14T09:14:02.118Z",
  "path":         "POLICY_APPROVAL_GATE",
  "action":       "SET_POWER_LIMIT",
  "requested_w":  350,
  "enforced_w":   300,
  "readback_w":   300,
  "delta_w":      0,
  "prev_hash":    "2e57a3...419f",
  "entry_hmac":   "f33c91...8b02"
}

Operational Scenario

Governing rack-scale power volatility

Rack-scale systems like GB200 NVL72 draw on the order of 120 kW per rack, and synchronized training swings them between near-idle and full draw in seconds. Those fast, correlated ramps stress facility power and the grid — which is why operators and utilities increasingly require ramp-rate-limited power actions (power smoothing). The hard part isn't smoothing the ramp; it's proving it happened within limits.

Spark-XC governs each ramp-rate-limited action across all five validation paths — and commits a Power Event Record for every one. Mission Control executes the smoothing; Spark-XC proves it stayed within the ramp-rate, floor, and ceiling your policy and your utility agreement require.

Ramp-rate policy gates

Path 4 enforces ramp-rate, floor, and ceiling limits before an action reaches hardware — so a synchronized job can't slam the rack from idle to full outside the agreed envelope.

Telemetry & facility reconciliation

Paths 1 and 3 confirm the actual ramp at the GPU and reconcile it against rack, PDU/UPS, and utility data — proving the facility-side effect, not just the requested setpoint.

Proof for the utility

Path 5 commits each smoothed action to the tamper-evident chain, so demand-response and ramp-rate compliance are independently replayable — settlement-grade evidence, not assertions.

Go Deeper

See a power action validated end to end

Walk a single power action through all five validation paths, see the data that flows between them, and replay the Power Event Record it commits.

View Pipeline Details → Request an AI Power Event Replay