Five validation paths, one Power Event Record

Spark-XC sits above existing GPU, workload, facility, grid, and finance systems to validate, authorize, and prove AI power actions. Every governed action is checked across five independent validation paths and committed as a Power Event Record. Mission Control executes. Spark-XC validates.

Explore the Pipeline → Request an AI Power Event Replay

Where SPARK-XC sits in your AI infrastructure

AI infrastructure already runs vendor stacks that execute power actions — NVIDIA Mission Control, DCGM, schedulers, DCIM, BMS, and facility controls. Spark-XC does not compete with them.

It sits above those systems as a governance layer: validating, authorizing, and proving each power action across GPU, workload, facility, grid, and finance before and after it reaches hardware — then committing a Power Event Record.

Key architectural property
Governance is layered above the vendor stack: every action is validated across five independent paths and proven with a Power Event Record — without replacing Mission Control, DCGM, or your schedulers.
SPARK-XC Governance Layer
1 · GPU Telemetry Validation
TELEMETRY
2 · Workload / Scheduler Context
WORKLOAD
3 · Facility Power Correlation
FACILITY
4 · Policy / Approval Gates
POLICY
5 · Evidence Chain (PER)
PROVE
GPU & Workload Control
EXECUTE
Mission Control · DCGM · Slurm · K8s · Run:ai
VENDOR
Facility & Grid — DCIM · BMS · PDU/UPS · Utility
FACILITY
GPU Hardware
PCIe x16

Spark-XC sits above — and stamps every action

Governance is layered above the execution path. As each action passes the Govern layer, Spark-XC stamps it with a Power Event Record. Because it validates and proves rather than gating execution inline, the stack keeps running even if governance is offline.

Spark-XC sits above execution: it stamps each action with a Power Event Record without sitting in the control path.

Governance offline: execution still flows Execute → Hardware — but no PER is produced for those actions. Spark-XC proves; it does not block, so resilience is preserved and the gap is itself recorded once governance returns.

Each validation path, explained

PATH 01
GPU Telemetry Validation
Captures NVML/DCGM pre- and post-action snapshots — power, clocks, utilization, and temperature — to confirm what the GPU actually did against what was requested.
NVML / DCGM
PATH 02
Workload / Scheduler Context
Pulls job and throughput context from Slurm, Kubernetes, and Run:ai, so every power action is validated against the workload it affects — not just the raw register.
Slurm · K8s · Run:ai
PATH 03
Facility Power Correlation
Correlates GPU-side power with DCIM, BMS, PDU/UPS, and utility signals where available — tying a GPU-level action to its rack, room, and grid impact.
DCIM · BMS · Grid
PATH 04
Policy / Approval Gates
Authority, rate, scope, and oscillation gates evaluate every action and authorize, modify, defer, or block it before it reaches hardware. Gates are enterprise-configurable.
Authorize · Block
PATH 05
Tamper-Evident Evidence Chain
Each action is committed as a Power Event Record, SHA-256 hash-chained to its predecessor on an append-only chain (ARIV). Any insertion, deletion, or modification breaks the chain — and every PER is independently replayable.
PER · SHA-256 chain

Designed for any path to fail

The SPARK-XC architecture assumes failure. The five validation paths are independent, and every degradation is itself committed to the evidence chain as a Power Event Record — so a missing signal is proven, not silently dropped.

Independent Paths
Telemetry, workload, facility, policy, and evidence are validated independently. A gap in one path — say, no DCIM signal available — narrows what can be proven, but does not silently pass the action through.
Fail-Closed Gates
When required context is missing or a gate cannot confirm authority, rate, scope, or oscillation limits, the policy path can defer or block the action rather than authorize blind.
Failure Is Evidence
A failure in any path is itself a recorded event. The evidence chain captures fault conditions — including the governance layer's own degradation — so the audit trail stays complete even when a path degrades.
Governs, Doesn't Replace
Spark-XC sits above Mission Control, DCGM, schedulers, DCIM, and BMS — so even if the governance layer is offline, the vendor stack continues to execute. Out-of-band changes (e.g. a direct nvidia-smi -pl) are detected and committed to the chain.
Graceful Degradation
Watchdogs, circuit breakers, and an explicit DEGRADED mode govern the validation layer's own health. If validation stops entirely, the underlying vendor stack remains in control and the last committed state is preserved.
Independently Replayable
The validation properties are externally measurable. GPU telemetry snapshots, facility correlation, and the SHA-256 evidence chain are each verifiable, and every Power Event Record can be replayed by an operator, auditor, or CFO.

The Power Event Record, concretely

Every governed power action emits one Power Event Record — an evidence bundle answering whether it was approved, safe, auditable, and financially real. Each PER includes a timestamp, action parameters, a telemetry snapshot, and a SHA-256 hash computed over the entry concatenated with the previous entry's hash — anchoring it to an append-only, tamper-evident chain (ARIV) that is independently replayable (HMAC signing available when a key is configured).

// SPARK-XC Power Event Record — ARIV chain entry (schematic) { "seq": 14820, "timestamp_us": "2025-09-14T09:14:02.118Z", "path": "POLICY_APPROVAL_GATE", "action": "SET_POWER_LIMIT", "requested_w": 350, "enforced_w": 300, "readback_w": 300, "delta_w": 0, "prev_hash": "2e57a3...419f", "entry_hash": "f33c91...8b02" // SHA-256(entry || prev_hash) }

Governing rack-scale power volatility

Rack-scale systems like GB200 NVL72 draw on the order of 120 kW per rack, and synchronized training swings them between near-idle and full draw in seconds. Those fast, correlated ramps stress facility power and the grid — which is why operators and utilities increasingly require ramp-rate-limited power actions (power smoothing). The hard part isn't smoothing the ramp; it's proving it happened within limits.

Spark-XC governs each ramp-rate-limited action across all five validation paths — and commits a Power Event Record for every one. Mission Control executes the smoothing; Spark-XC proves it stayed within the ramp-rate, floor, and ceiling your policy and your utility agreement require.

Ramp-rate policy gates
Path 4 enforces ramp-rate, floor, and ceiling limits before an action reaches hardware — so a synchronized job can't slam the rack from idle to full outside the agreed envelope.
Telemetry & facility reconciliation
Paths 1 and 3 confirm the actual ramp at the GPU and reconcile it against rack, PDU/UPS, and utility data — proving the facility-side effect, not just the requested setpoint.
Proof for the utility
Path 5 commits each smoothed action to the tamper-evident chain, so demand-response and ramp-rate compliance are independently replayable — settlement-grade evidence, not assertions.

See a power action validated end to end

Walk a single power action through all five validation paths, see the data that flows between them, and replay the Power Event Record it commits.

View Pipeline Details → Request an AI Power Event Replay