Integrating Target Spans
Use stax-target when the interesting work is not directly visible to the
CPU sampler: GPU kernels, accelerator queues, async executors, worker pools,
media engines, or any runtime that can name and timestamp work.
The integration goal is a linked graph, not a fake single-thread stack:
CPU queue/dispatch stack --origin--> target lane -> named spanNo trace export, no second profiler, no timestamp correlation pass. The spans
land in the same threads, top, flame, timeline, and web UI views as CPU
samples and off-CPU intervals. Target lanes remain parallel execution lanes;
if a future integration also reports the CPU wait/completion side, stax can
offer a separate wall-time/critical-path view that nests target work under a
stack only when the same stack both dispatched and awaited that work.
The pattern
Add the crate:
[ dependencies ]
stax-target = { path = "../stax/stax-target" } Create one lane per logical executor:
let lane = stax_target:: Lane :: new ( "decoder worker" );
let _gpu_lane = stax_target:: Lane :: metal ( "GPU tq1s" ); Gate capture before paying instrumentation costs. When no matching stax recording is active, this is just one relaxed atomic load:
if lane. reporting_active () {
let origin = lane. current_origin ();
enqueue_work ( origin);
} For executor-style code, prefer carrying a CapturedOrigin token. It remembers
both "capture was active" and the optional OS-thread origin. If the platform
cannot capture an origin, lane-only views still work.
Lane::begin_span, Lane::begin_span_with_origin, and
Lane::begin_span_with_captured_origin perform the active-recording gate, so
worker-side timing can use if let Some(open) = ... directly.
Capture the origin at the queue/dispatch site, then time/report the work where it actually runs:
struct Work {
origin : stax_target:: CapturedOrigin ,
}
fn enqueue ( lane : & stax_target:: Lane ) {
submit ( Work {
origin : lane. capture_origin (),
});
}
fn run_worker ( lane : & stax_target:: Lane , work : Work ) {
let open = lane. begin_span_with_captured_origin ( "decode chunk" , work. origin );
decode_chunk ();
if let Some ( open) = open {
open. finish_and_report ( lane);
}
} That is the general executor form. For APIs that already give exact start/end timestamps, build spans directly:
let origin = lane. capture_origin ();
if let Some ( span) = lane. span_with_captured_origin ( "kernel_name" , start_ns, end_ns, origin) {
lane. report_one ( span);
} Use stax_target::now_ns() when your target-side timestamps should come from
the same host clock stax expects. Use SpanBuilder when an integration wants
to validate or attach origins before deciding where to report a span.
For richer integrations, use Lane::dispatch_builder(name) and
TargetRecordBatch. The simple span API keeps working, but the richer path can
attach stable dispatch/shader/source ids, wait/completion origins, attachment
ids, and counter sample ids:
let dispatch_id = stax_target:: TargetDispatchId :: new ( 42 );
let shader_id = stax_target:: TargetShaderId :: new ( 7 );
let origin = lane. capture_origin ();
let span = lane
. dispatch_builder ( "tq6_1s_rows" )
. with_captured_origin ( origin)
. with_dispatch_id ( dispatch_id)
. with_shader_id ( shader_id)
. timestamps ( start_ns, end_ns)
. build ();
if let Some ( span) = span {
lane. report_batch ( vec! [ span], stax_target:: TargetRecordBatch :: default ());
} Metadata-only batches are allowed. For example, a target can register source or
counter definitions before dispatches resolve; stax diagnose will show that
the records are arriving even if no spans have landed yet.
For integration health logs, admin endpoints, or assertions in your own test
fixtures, read lane.reporter_stats() or stax_target::reporter_stats().
That snapshot is passive: it reports whether the background worker has been
armed, the last capture-gate state, whether the worker currently has a
stax-server connection, and local drop counters, but it does not start polling
by itself. Use lane.reporting_active() as the real capture gate.
Recipes by integration style
Thread pools
Capture at submission, carry the token in the work item, and open the span in the worker:
struct Job {
origin : stax_target:: CapturedOrigin ,
}
fn submit ( pool : & Pool , lane : & stax_target:: Lane ) {
pool. push ( Job {
origin : lane. capture_origin (),
});
}
fn worker ( lane : & stax_target:: Lane , job : Job ) {
let open = lane. begin_span_with_captured_origin ( "run job" , job. origin );
run_job ();
if let Some ( open) = open {
open. finish_and_report ( lane);
}
} Async executors
Capture before the work crosses the async boundary. Open the span only when a task or worker actually starts doing the work:
async fn schedule ( tx : & Queue , lane : & stax_target:: Lane ) {
tx. send ( Job {
origin : lane. capture_origin (),
})
. await ;
}
async fn run_worker ( lane : & stax_target:: Lane , job : Job ) {
let open = lane. begin_span_with_captured_origin ( "poll work item" , job. origin );
poll_work_item (). await ;
if let Some ( open) = open {
open. finish_and_report ( lane);
}
} Exact timestamp APIs
When an API gives exact start/end timestamps, capture the origin at dispatch and report after completion:
fn dispatch ( lane : & stax_target:: Lane ) {
let origin = lane. capture_origin ();
let completion = submit_to_runtime ();
if let Some ( span) = lane. span_with_captured_origin (
completion. name ,
completion. start_ns ,
completion. end_ns ,
origin,
) {
lane. report_one ( span);
}
} GPU timestamp counters
For Metal 4 or similar APIs, the important boundary is the timestamp conversion. Convert the target timestamps into the same nanosecond clock domain stax expects, then report ordinary target spans:
fn encode_kernel ( lane : & stax_target:: Lane , command : & mut Command ) {
let origin = lane. capture_origin ();
command. encode_dispatch ();
command. on_complete ( move |timestamps| {
let start_ns = gpu_timestamp_to_stax_ns ( timestamps. start );
let end_ns = gpu_timestamp_to_stax_ns ( timestamps. end );
if let Some ( span) = lane. span_with_captured_origin (
timestamps. kernel_name ,
start_ns,
end_ns,
origin,
) {
lane. report_one ( span);
}
});
} Bad-origin debugging
If spans arrive but do not link back to CPU dispatch origins, run:
stax diagnoseThe usual fixes are:
- capture at queue/dispatch time, not completion time
- capture on the OS thread that queued the work
- keep span timestamps in one monotonic nanosecond clock
- keep span names semantic and low-cardinality
stax diagnose names the failure mode:
bad_tidmeans the origin tid was itself a synthetic target lane; capture on a real CPU thread before reporting target work.no_threadmeans no PET samples were recorded for that tid in this run; check that the profiled process/thread is really the one dispatching work.no_stackmeans the tid was sampled, but without a user stack; check frame pointers / DWARF unwinding or whether the dispatch site only sampled in kernel/runtime glue.too_farmeans the nearest sampled CPU stack was outside stax's origin window; move origin capture closer to the queue/submit point.
Demo workload
The repo includes several target-span demos:
| example | what it demonstrates |
|---|---|
executor | minimal queue/worker split with CapturedOrigin |
thread_pool | multiple workers sharing one logical target lane |
async_executor | scheduling work into an async channel, then timing it in the async worker |
codec | exact host timestamps with decode/encode lanes |
model_runtime | semantic model-runtime lanes with SpanBuilder |
gpu_timestamps | Metal-style timestamp-counter conversion without SDK dependencies |
bad_origins | intentionally missing/stale/wrong-thread origins for stax diagnose |
corpus | blessed CPU/off-CPU/target-span/origin diagnostics workload |
For example:
just demo-corpus
stax top --tid < corpus-executor-tid> --sort self
stax target top --by time
stax target top --by count
stax top --tid < cpu-tid> --sort total
stax flame --tid < cpu-tid> --threshold-pct 0just demo-corpus records stax-target/examples/corpus.rs, then prints
stax threads -n 0 and stax diagnose. The thread table will show synthetic
lanes such as corpus executor, corpus gpu, and corpus bad origins. A
synthetic lane's target ms is the exact duration of reported work and
spans is the span count. When origins link, filtering flame/top to the CPU
thread that queued work limits the target lane tree to that origin's work; the
web target-span details show the linked CPU dispatch stack.
Diagnostics
stax diagnose reports target ingest health:
- batches and spans received/recorded
- batches/spans dropped because no run was active or the pid did not match
- batches/spans dropped in stax-target before reaching the server because the local queue filled or the background worker disconnected
- dropped spans with invalid durations
- per-lane duration and span totals
- origin-linked and origin-unlinked counts
- unlinked-origin reasons:
bad_tid,no_thread,no_stack,too_far - linked and too-far origin PET distance min/avg/max
If spans show up on the synthetic lane but do not link to CPU dispatch
origins, check the origin counters first. Unlinked origins usually mean the
target captured the origin too far from the queue point, used the wrong
thread, or the CPU sampler did not catch a nearby PET sample. If too_far
dominates, the average/max distance tells you how stale the origin was
relative to the nearest CPU stack.
Inside the cooperating process, reporter_stats() answers a different
question: whether the local reporter worker is armed/connected and whether it
has already dropped batches before the server could see them. That is the
right signal for target-side logs such as "we are being recorded but the local
queue is overflowing".
Specializations
- Profiling GPU Work shows the same target span contract applied to Metal 4 timestamp-counter kernels.
- JIT code naming is a different contract: see Profiling JIT Code.