Five layers. Each one sufficient on its own.

The SPARK-XC pipeline never assumes the previous layer succeeded. Each enforcement boundary is independently capable of maintaining safety — and each failure is captured, logged, and auditable.

View Architecture → See Use Cases

Every layer, in detail

01
Hardware Clamping
The first and most fundamental layer. SPARK-XC writes a maximum power ceiling directly to the GPU's hardware power limit register. This ceiling is enforced at the silicon level — independent of any OS, driver, or application. Even a crashed kernel cannot override it.

The register write occurs at system initialization and is re-enforced on any detected tampering attempt. The hardware register becomes the absolute floor that all other layers operate within.
Below OS enforcement Register-level write Re-enforced on tamper detection Survives driver failure
POWER_LIMIT_REGISTER
> REQUESTED: 350W
> CLAMPED_TO: 300W
> HW_LOCK: TRUE
> BYPASS: NONE
✓ ENFORCED
02
Thermal Emergency Response
The second layer runs a high-frequency thermal sensor polling loop, completely independent of the main execution path. When any monitored sensor exceeds a configured threshold, this layer triggers an immediate emergency power reduction — bypassing governance gates and executing directly.

The response latency target is under 2ms from threshold breach to power reduction confirmation. This layer cannot be disabled by software policy — it is an always-on safety circuit.
<2ms response target Always-on polling loop Bypasses governance in emergency Multi-sensor monitoring
THERMAL_MONITOR
> CURRENT: 78°C
> WARNING: 85°C
> EMERGENCY: 95°C
> STATE: ARMED
⚡ ARMED
03
Governance Gates
All power limit requests — whether from a user application, an operator, or an automated scheduler — must pass through the governance gate before execution. The gate evaluates the request against a configurable ruleset.

Rules can encode time-based constraints, fleet-wide policies, per-workload budgets, or operator-specific limits. Requests that fail the gate are rejected and logged. The gate itself is audited — every evaluation produces a record.
Configurable ruleset Per-workload policies Reject / modify / approve Every evaluation logged
POLICY_ENGINE
> RULES_LOADED: 48
> REQUEST: 300W
> MATCHED: rule_17
> RESULT: PASS
✓ GATE PASS
04
Execute + Verify
Once a request has passed the governance gate, Layer 4 executes the change and immediately reads back the hardware register to verify the intended state was applied. A delta of zero is required — if the readback does not match the request, the discrepancy is flagged, logged, and an alert is triggered.

This layer catches silent hardware failures, register corruption, and any tampering that occurred between the governance gate and hardware execution.
Immediate readback verification Delta = 0W required Discrepancy alerts Catches silent hw failures
EXEC_VERIFY
> ACTION: SET_300W
> HW_READBACK: 300W
> DELTA: 0W
> RESULT: VERIFIED
✓ VERIFIED Δ=0W
05
Cryptographic Audit Logging
Every event in the SPARK-XC pipeline — including actions, governance decisions, thermal events, verification results, and layer faults — is written to the cryptographic audit log. Each entry is signed with HMAC-SHA256 and chained to its predecessor.

The chain means that any insertion, deletion, or modification of a historical entry is immediately detectable. The log is the permanent, tamper-evident record of every power management decision ever made by the system.
HMAC-SHA256 signing Forward-chained entries Tamper-evident by design Logs even layer failures
AUDIT_CHAIN
> SEQ: 14820
> PREV: 2e57...419f
> HMAC: f33c...8b02
> CHAIN: VALID
🔗 CHAINED

What happens when a layer fails

SPARK-XC is designed for these scenarios. Here is how the pipeline responds to representative fault conditions.

Scenario A
Driver crash mid-operation
The CUDA driver or kernel driver crashes while processing a power limit request. Layer 1 has already enforced the hardware ceiling at boot — the GPU hardware cannot exceed the clamped value regardless of driver state.
L1 hardware clamp remains active. GPU protected.
Scenario B
Thermal spike during governance evaluation
A GPU temperature spike occurs while Layer 3 is evaluating a governance rule. Layer 2 operates on an independent polling loop and does not wait for Layer 3. It immediately triggers emergency throttling.
L2 bypasses L3 and enforces emergency limit. Event logged.
Scenario C
Silent hardware register failure
Layer 4 attempts to set a power limit and the command appears to succeed — but a hardware fault causes the register to retain its previous value. Layer 4 reads back the register and detects the delta mismatch.
L4 flags discrepancy. Alert triggered. Event logged in L5.
Scenario D
Policy engine configuration error
A misconfigured governance rule rejects all requests, causing Layer 3 to block operations. Layer 1 is still enforcing the hardware ceiling. The system is safe — just not accepting new limits. Layer 5 records all rejected requests.
L1 maintains safety floor. L5 provides full diagnostic trail.
Scenario E
Audit log tampering attempt
An adversary attempts to delete a log entry from the audit chain. The deletion breaks the HMAC chain — the next entry's hash no longer matches. The chain integrity check immediately detects and flags the tampering.
Chain integrity violation detected. Forensic investigation enabled.
Scenario F
Multiple simultaneous layer failures
Layers 2, 3, and 4 fail simultaneously (e.g., due to a software bug affecting the SPARK-XC daemon). Layer 1 hardware clamping remains active — it is implemented below the software stack and requires no daemon to function.
L1 enforces ceiling independently. L5 logs all failures.

Pipeline timing profile

Layer
Normal Latency
Emergency Latency
Failure Mode
Recovery
L1 · HW Clamp
Hardware register write
<0.1ms
Register write latency
Always-on, no emergency path
HW fault
Register unwritable
Alert + L5 log
External intervention required
L2 · Thermal
Sensor polling loop
<1ms
Polling interval
<2ms
Detection to action
Sensor failure
Conservative throttle applied
Auto-conservative
Reduce limit until sensor restored
L3 · Governance
Policy evaluation
<5ms
Ruleset evaluation
Bypassed
Emergency path skips L3
Config error
Default-deny applied
Admin config update
Rules reloaded live
L4 · Verify
Execute + readback
<2ms
Write + read cycle
<2ms
Same path
Delta mismatch
Alert raised, retry or escalate
Alert + retry
Up to N retries before escalation
L5 · Audit
Log append + HMAC
<1ms
Hash + write
<1ms
Same path
Storage full
Oldest entries rotated, chain preserved
Auto-rotate
Chain integrity maintained

See how SPARK-XC fits your infrastructure

Explore real-world deployment scenarios across data centers, AI labs, and enterprise GPU fleets.

View Use Cases → Contact Us