Platform Overview

The governance layer your AI infrastructure power stack has been missing.

Spark-XC sits above existing GPU, workload, facility, grid, and finance systems to validate, authorize, and prove AI power actions — across five validation paths, with a tamper-evident Power Event Record for every action. Mission Control executes. Spark-XC validates.

Explore the Architecture → Request an AI Power Event Replay

Core Capabilities

Everything your AI infrastructure needs to prove every power action

GPU Telemetry Validation

Pre- and post-action NVML/DCGM snapshots — power, clocks, and utilization — confirm what the hardware actually did, not just what was requested.

Workload & Scheduler Context

Correlates each power action with Slurm, Kubernetes, and Run:ai job context so every change is tied to the workload it served.

Facility Power Correlation

Cross-checks GPU-level actions against DCIM, BMS, PDU/UPS, and utility APIs so rack, facility, and grid effects all reconcile.

Policy & Approval Gates

Authority, rate, and scope gates confirm a power action is permitted, scoped, and rate-limited before it ever reaches hardware.

Tamper-Evident Evidence Chain

Every governed action is SHA-256 hash-chained into an append-only audit chain (ARIV) — a forensically complete, tamper-evident Power Event Record for every decision.

Sits Above, Doesn't Compete

Spark-XC governs on top of NVIDIA Mission Control, DCGM, schedulers, DCIM, and BMS. Mission Control executes; Spark-XC validates, authorizes, and proves.

How It Works

From power action to proof

Every AI power action runs through five validation paths and emits a single Power Event Record — the atomic unit of proof that it was approved, safe, auditable, and financially real.

Power Action Requested

A workload, scheduler, operator, or vendor stack (NVIDIA Mission Control, DCGM) initiates a power action that Spark-XC governs.

GPU Telemetry Validation

Path 1 captures NVML/DCGM pre- and post-snapshots — power, clocks, utilization — to confirm what the hardware actually did.

Workload & Facility Correlation

Paths 2 and 3 tie the action to Slurm, Kubernetes, and Run:ai job context, then reconcile it against DCIM, BMS, PDU/UPS, and utility APIs.

Policy & Approval Gates

Path 4 evaluates authority, rate, and scope gates. The action is either authorized, modified, or rejected before it reaches hardware.

Evidence Chain Commit

Path 5 hash-chains the action into the append-only, SHA-256 tamper-evident chain (ARIV) — optional HMAC signing, always available.

Power Event Record Emitted

A self-contained, independently replayable Power Event Record is committed — proving the action was approved, safe, auditable, and financially real.

Validation Flow

Power Action

REQUEST

↓

Mission Control / DCGM / Scheduler

EXECUTE

↓

SPARK-XC VALIDATION PATHS

1 · GPU Telemetry Validation
NVML/DCGM

2 · Workload / Scheduler Context
CONTEXT

3 · Facility Power Correlation
CORRELATE

4 · Policy / Approval Gates
GATE

5 · Tamper-Evident Evidence Chain
PER

↓

Power Event Record

PROVEN

Governance Posture

Spark-XC sits above — and stamps every action

Governance is layered above the execution path. As each action passes the Govern layer, Spark-XC stamps it with a Power Event Record. Because it validates and proves rather than gating execution inline, the stack keeps running even if governance is offline.

Governance offline?

Govern (Spark-XC) validate · authorize · prove

PER OFFLINE

Execute (Mission Control)DCGM · scheduler

FacilityDCIM · BMS · PDU

HardwareGPU

ACTIONPER

Spark-XC sits above execution: it stamps each action with a Power Event Record without sitting in the control path.

Governance offline: execution still flows Execute → Hardware — but no PER is produced for those actions. Spark-XC proves; it does not block, so resilience is preserved and the gap is itself recorded once governance returns.

Observability

See every power action — read-only by design

The Spark-XC dashboard is a read-only window into your fleet — live power, temperature, and energy telemetry alongside the Power Event Record stream. It observes and proves; it never executes a control action. It reads straight from the ARIV evidence chain and pairs with the metrics you already run on Prometheus and Grafana.

SPARK-XC DASHBOARD · GET /fleet/summary · SAMPLE

Operational Read-Only

Fleet Power

4,647 W

Avg Temperature

51 °C

Power Savings

38.8 %

Power Event Records

28,491

GPU	Power	Temp	Util	Status
GPU-00 · B200	580 W	51 °C	96 %	● governed
GPU-01 · B200	583 W	52 °C	96 %	● governed
GPU-02 · B200	579 W	50 °C	95 %	● governed
GPU-03 · B200	581 W	51 °C	96 %	● governed

Polls /fleet/summary every 5s · no control actions exposed8 GPUs governed · 4 shown · ARIV ✓

Read-Only by Design

The dashboard surfaces telemetry and Power Event Records — it never issues a power action. Governance and execution stay separate from observation.

Grafana & Prometheus

Fleet power savings, per-GPU power and thermal, control-loop latency, and safety violations export to the Prometheus and Grafana dashboards you already operate.

Straight From the Evidence Chain

Every figure traces back to a Power Event Record on the tamper-evident ARIV chain — what you see on screen is the same evidence an auditor can replay.

Why SPARK-XC

Ungoverned power actions vs. proven ones

Without a Governance Layer

Vendor stack executes — but no one validates the action
GPU telemetry never reconciled against facility and grid data
Workload and scheduler context lost after the fact
Logs are mutable, scattered, and often incomplete
No authority, rate, or scope gates — any action is honored
No way to prove an action was financially real

SPARK-XC

Sits above the vendor stack — every power action validated and authorized
Five validation paths span GPU, workload, facility, policy, and evidence
NVML/DCGM pre/post snapshots reconciled with DCIM, BMS, and utility APIs
SHA-256 hash-chained, tamper-evident evidence chain (optional HMAC signing)
Authority, rate, and scope gates — every action evaluated before hardware
A Power Event Record proves each action was approved, safe, auditable, and financially real

Deployment

Ready to deploy in hours, not months

SPARK-XC sits above your existing stack with minimal deployment friction. No kernel modifications. No driver replacements. No application changes. It governs on top of NVIDIA Mission Control, DCGM, schedulers, DCIM, and BMS rather than replacing them.

Prerequisites

✓ NVIDIA GPU (Ampere, Ada, Hopper, Blackwell) or AMD Instinct (MI210, MI250, MI300)

✓ Linux host (Ubuntu 20.04+, RHEL 8+)

✓ NVIDIA driver 525+ or AMD ROCm 5.0+

✓ Root access for hardware register operations

What You Get

→ All 5 validation paths active within minutes

→ Policy and approval gates configurable via JSON/YAML

→ Power Event Records emitted from the first action

→ Zero application code changes required

Get Started

Ready to prove every power action in your AI infrastructure?

We're onboarding partners now. Request a replay and see a real Power Event Record — approved, safe, auditable, and financially real — for your environment.

Request an AI Power Event Replay → View Architecture