Profiling GPU work
stax recordings are not limited to what the CPU sampler can see. A profiled app can put target/executor lanes on the same timeline as its threads: GPU command queues, accelerator work, runtime queues, or any other work the target can timestamp and name. GPU work is the first concrete consumer, but the mechanism is deliberately general — no export step, no second tool, no correlation pass.
This page is the GPU specialization of the generic target-span integration contract.
How it works
The app links the stax-target crate and does two things:
Gate capture on
stax_target::reporting_active()— one relaxed atomic load, safe on hot paths. A background worker (spawned on first use; threads namedstax-target/stax-target-io) polls stax-server about once a second: "is a recording of my pid active?" Attach and detach propagate within one poll period. No server, no socket, server restart — all degrade to "off" and recover on a later poll. The app pays its span-capture cost (e.g. GPU timestamp heaps) only while recorded.Report spans through a lane such as
stax_target::Lane::metal("GPU tq1s")— fire and forget, bounded queue, drop-newest; eachTargetSpanis a name plus absolutemach_absolute_time-derived nanoseconds (Apple Silicon GPU timestamps share that timebase, which is why no correlation step exists anywhere). A target can also attach aTargetSpanOrigincaptured withstax_target::current_span_origin()at dispatch/queue time; that gives stax the CPU tid and timestamp needed to borrow the nearest sampled CPU stack.
Server-side (TargetIngest), each (pid, lane) becomes a synthetic
thread — a pseudo-tid at/above 0xFFF0_0000 — and each distinct span
name becomes a synthetic symbol. Each reported span records one sample
marker plus one attributed synthetic execution interval, so kernel/job names
render like function names in top, flame, and the web UI timeline. The
legacy active-time fields include these durations for compatibility, and the
newer reporting surfaces break them out explicitly as target time and
target span counts. Target lanes are parallel execution lanes. With origins,
top/flame for the dispatching CPU tid include the matching lane work and
the web target-span details link each span back to the sampled CPU stack that
queued it; the lane still renders as lane -> span unless a future
integration also reports the CPU wait/completion side for a wall-time view.
A worked example: bee's hx
bee's Metal 4 runtime captures per-dispatch GPU timestamps and reports
them as the "GPU tq1s" Metal lane (bee/rust/helix-metal4/src/stax.rs):
stax record -- ./target/release/hx run --cfg configs/production.jsonc …
stax threads | grep -i gpuIn the verified 2026-06-12 hx run, this lane had 6300 ingested kernel
spans. For synthetic lanes, target ms is the exact sum of reported span
durations and spans is the span count. The old active-time field still
includes the same duration so flame widths and older clients keep working,
but it does not mean a CPU thread was busy.
Reading the results
stax threads— existence + span count. Synthetic tids live at/above0xFFF0_0000; target lanes with spans are included even past the normal-ncutoff.- Web UI timeline (
ws://127.0.0.1:8080, see The Web UI) — the lane drawn against the real threads, spans named per kernel. stax top --tid <synthetic>/stax flame --tid <synthetic>— per-kernel aggregation.topreports total span duration intarget msand span count inspans.flamerenders(all) -> lane -> span namewith target time/span columns.stax top --tid <cpu tid>/stax flame --tid <cpu tid>— when the target reports span origins, these thread-scoped views include the GPU work queued from that CPU thread as parallel lane work. Use the web target-span details to jump from a kernel/span back to the CPU dispatch stack.stax diagnose— target ingest counters: batches, recorded/dropped spans, no-active-run / wrong-pid drops, total target duration, lanes, and origin link/unlink counts. It also reports target-side stax-target queue drops, why origins did not link (bad_tid,no_thread,no_stack,too_far), and the min/avg/max PET sample distance for linked or too-far origins. Use this when spans are missing or CPU-stack attribution is missing.
Interpreting a GPU-bound target
Expect stax top to look almost empty — single-digit samples, allocator
noise. That IS the finding: the CPU is idle and the time lives in off-CPU
waits (stax threads) and the GPU lane. Do not conclude "stax is a CPU
profiler and can't help here"; the recording already contains the GPU
story.