Profiling JIT Code

By default a sampling profiler sees JIT'd code as <unresolved>(no binary mapped at 0x…). The machine code lives in an anonymous mmap, not in any Mach-O or ELF with a symbol table. stax already has the machinery to fix this; it just needs the JIT runtime to cooperate by emitting a perf jitdump file.

This page is the contract. Any JIT — Cranelift, a custom backend, V8, the JVM — that follows it will light up in stax top, stax flame, and stax annotate.

How stax consumes a jitdump

  • The runtime writes a growing stream of records to /tmp/jit-<pid>.dump, where <pid> is the profiled process's PID.
  • stax's preload library notices the target open() that path and tells the recorder about it.
  • A tailer opens the file, parses the 40-byte global header, and on every tick reads newly-appended JIT_CODE_LOAD records — emitting a synthetic binary-load event per compiled function.

From then on, that address range resolves to the name you gave it. Because each record carries the code bytes, stax annotate can disassemble JIT'd functions too — no task_for_pid / memory read needed.

So the whole job is: emit the file, emit one JIT_CODE_LOAD per compiled function, and keep appending.

The file format

Reference: perf's own jitdump-specification.txt. All integers are little-endian on aarch64 / x86_64; stax accepts the magic in either endianness and infers.

Global header — 40 bytes, written once

fieldtypevalue
magicu320x4A695444 ("JiTD"), host-endian
versionu321
total_sizeu3240 (size of this header)
elf_machu32EM_AARCH64 = 183 (stax ignores it)
pad1u320
pidu32getpid()
timestampu64any monotonic value
flagsu640

Record prefix — 16 bytes, every record

fieldtypevalue
idu320 = JIT_CODE_LOAD
total_sizeu32prefix + body, including name and code
timestampu64monotonic; ordering only

JIT_CODE_LOAD body

fieldtypevalue
pidu32getpid()
tidu32thread id (0 is fine)
vmau64load address — stax keys on this
code_addru64same as vma for a JIT
code_sizeu64length of the machine code
code_indexu64per-process incrementing counter
namechar[]NUL-terminated, free-form UTF-8
native_codeu8[code_size]the actual bytes

total_size = 16 + 40 + name.len() + 1 + code_size.

stax surfaces only JIT_CODE_LOAD (id = 0) today, and skips records with any other id silently — you do not need CODE_CLOSE, unwind, or debug-info records.

A minimal producer (Rust)

rust
// Once, at file creation: write the 40-byte global header above.
// Per compiled function, append one JIT_CODE_LOAD record:
fn register(file: &mut std::fs::File, name: &str, addr: u64, code: &[u8]) {
    use std::io::Write;
    let total = 16 + 40 + name.len() + 1 + code.len();
    let mut r = Vec::with_capacity(total);
    r.extend_from_slice(&0u32.to_ne_bytes());             // id = JIT_CODE_LOAD
    r.extend_from_slice(&(total as u32).to_ne_bytes());   // total_size
    r.extend_from_slice(&timestamp().to_ne_bytes());      // timestamp
    r.extend_from_slice(&pid().to_ne_bytes());            // pid
    r.extend_from_slice(&0u32.to_ne_bytes());             // tid
    r.extend_from_slice(&addr.to_ne_bytes());             // vma
    r.extend_from_slice(&addr.to_ne_bytes());             // code_addr
    r.extend_from_slice(&(code.len() as u64).to_ne_bytes());
    r.extend_from_slice(&index().to_ne_bytes());          // code_index
    r.extend_from_slice(name.as_bytes());
    r.push(0);                                            // NUL terminator
    r.extend_from_slice(code);                            // native_code
    file.write_all(&r).and_then(|_| file.flush()).ok();
}

Then profile it like anything else:

bash
stax record -- ./your-jit-app
stax wait --for-samples 40000
stax top -n 10 --sort self      # JIT functions now show by name
stax annotate 'my_jit::func'    # per-instruction sample counts

Gotchas learned the hard way

  • Dump the finalized bytes, not the assembler buffer. With Cranelift, CompiledCode::code_buffer() gives the right length, but copy the bytes from the address JITModule::get_finalized_function returns — those have relocations applied, so stax annotate shows real bl / call targets instead of zeroed call slots.
  • The path is keyed by the target's PID. Use std::process::id(), not a parent's PID. If you launch under stax record -- env FOO=bar ./app, env execs app, so the PID is stable.
  • Flush after every record. The tailer polls; an unflushed record is just a function that never gets a name. Partial trailing records are fine — the tailer only consumes whole records and re-reads from the boundary.
  • Truncate on create. A stale dump from a previous run with recycled addresses will mis-symbolicate.
  • One record per function is enough. No CODE_CLOSE, no unwind tables.

See also

  • Stack Unwinding — JIT'd code still needs frame pointers to appear in a backtrace at all.
  • Inspecting a Run — what top, flame, and annotate show once your functions are named.