cache binaries that have nothing to instrument by gecube · Pull Request #325 · coroot/coroot-node-agent

gecube · 2026-06-30T13:43:35Z

Background

This follows up on #316 / #324. #324 stops the agent from logging an error for stripped Go binaries (failed to get symbol crypto/tls.(*Conn).Write: no symbols found). This PR addresses the underlying inefficiency that produced that log line on every process in the first place.

Problem

AcquireGlobalUprobe caches only successful attachments:

links := attach()
if len(links) == 0 {
    return UprobeKey{}, false   // nothing is remembered
}
t.globalUprobes[key] = &globalUprobe{links: links, refcount: 1}

When attach() finds nothing to instrument — most commonly a stripped Go binary such as kubectl (built with -ldflags "-s -w", so no symbol table) — the negative result is discarded. The per-process goTlsUprobesChecked guard only suppresses re-checks within a single process, so every new PID of the same binary re-opens the ELF and re-parses its symbol tables only to fail again. On nodes where such binaries run frequently (CI runners, operators invoking kubectl/helm in loops, cron) this is steady, pointless CPU and I/O — and, before #324, steady log noise.

(The work that precedes AcquireGlobalUprobe — reading build info, readlink, the version check — still runs per PID; it is needed to classify the process as a Go app and to resolve the cache key. The symbol-table parse inside attach() is the dominant cost and is what this cache eliminates.)

Why not hash the binary

The obvious correctness worry with caching a negative result is inode reuse: a file is deleted and its inode number is later reused by a different, possibly instrumentable, binary. A content hash would detect that, but hashing a 50–100 MB binary on every process spawn means reading the whole file from disk — far more expensive than the symbol parse we are trying to avoid. It trades a cheap operation for an expensive one.

(For the existing positive cache this race is effectively impossible: the entry is refcounted and held by live processes, so the inode cannot be freed and reused while the entry exists. It only matters once we start caching negatives, which nothing holds.)

Approach

Negatives live in their own map, separate from the refcounted positive cache, keyed by dev+inode. The size and mtime from the stat the function already performs are stored in the entry and re-checked on every lookup: a binary replaced in place, or one whose inode was recycled, almost always differs in size or mtime, so it is re-evaluated rather than served a stale verdict. This costs no extra I/O.

Keying by dev+inode (rather than folding size+mtime into the key) keeps the map bounded by the number of distinct inodes seen: a path that is repeatedly overwritten in place reuses a single entry instead of accumulating one per spawn.

size+mtime is a heuristic, not tamper-proof. A replacement that preserves both within the filesystem's timestamp granularity (cp -p, tar -xp, a snapshot restore) is served the stale verdict until the entry's TTL elapses; the TTL bounds that window. Deliberate tampering that preserves size and mtime is out of scope.

Expired entries are swept when a new negative entry is added, but at most once per TTL, so a burst of distinct stripped binaries does not turn every insertion into a full-map scan under the lock. Stale entries are additionally dropped lazily on lookup. Live (refcounted) attachments are never touched by the sweep.

Genuine attachments are unchanged.

Tests

ebpftracer/tracer_cache_test.go covers: an empty result is parsed once and then cached; a replaced binary (changed size/mtime) is re-evaluated; repeated in-place overwrite keeps a single cache entry; an expired negative entry is re-evaluated and replaced rather than duplicated; expired entries are swept while live entries survive; and concurrent acquisition of the same binary is race-clean. Run with go test -race ./ebpftracer/.

AcquireGlobalUprobe only cached successful attachments. When attach() returned no links (e.g. a stripped Go binary such as kubectl, which has no symbol table), nothing was remembered, so the binary was re-opened and its symbols re-parsed for every new process that ran it. On nodes that run such binaries frequently this is pure overhead. Cache the negative result in a separate map keyed by dev+inode. The binary's size and mtime are stored in the entry and re-checked on every lookup, so a binary replaced in place or whose inode is recycled is re-evaluated rather than served a stale verdict. Entries expire after a TTL; expired ones are swept on insertion, but at most once per TTL, so a burst of new binaries does not turn every insertion into a full-map scan under the lock. Live (refcounted) attachments are unchanged.

gecube force-pushed the negative_uprobe_cache branch from d1bc865 to 3bbab9f Compare June 30, 2026 14:54

def added the AI slop label Jul 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cache binaries that have nothing to instrument#325

cache binaries that have nothing to instrument#325
gecube wants to merge 1 commit into
coroot:mainfrom
gecube:negative_uprobe_cache

gecube commented Jun 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

gecube commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Problem

Why not hash the binary

Approach

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gecube commented Jun 30, 2026 •

edited

Loading