Skip to content

watchdog: add new action to capture backtraces#44620

Open
jmsadair wants to merge 35 commits into
envoyproxy:mainfrom
jmsadair:backtrace-watchdog-action
Open

watchdog: add new action to capture backtraces#44620
jmsadair wants to merge 35 commits into
envoyproxy:mainfrom
jmsadair:backtrace-watchdog-action

Conversation

@jmsadair

@jmsadair jmsadair commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

Commit Message: watchdog: add new action to capture backtraces
Additional Description: Adds envoy.watchdog.backtrace_action, a new watchdog action that captures stack traces of stuck threads. When triggered, the action signals each offending thread via SIGUSR2, which captures the trace in-place in the signal handler, then logs it on the dispatcher thread. A configurable per-thread cooldown (default: 10s) prevents trace spam on persistent stalls.
Risk Level: Low.
Testing: Added unit tests.
Docs Changes: Updated watchdog.rst.
Release Notes: Added.
Platform Specific Features: This feature is only supported on Linux due to use of the tgkill system call to send signals to threads.

James Adair added 10 commits April 19, 2026 15:10
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
@repokitteh-read-only

Copy link
Copy Markdown

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #44620 was opened by jmsadair.

see: more, trace.

@jmsadair jmsadair marked this pull request as ready for review April 23, 2026 22:14
@repokitteh-read-only

Copy link
Copy Markdown

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @markdroth
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #44620 was ready_for_review by jmsadair.

see: more, trace.

James Adair added 5 commits April 23, 2026 22:28
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
@repokitteh-read-only

Copy link
Copy Markdown

CC @envoyproxy/coverage-shephards: FYI only for changes made to (test/coverage.yaml).
envoyproxy/coverage-shephards assignee is @RyanTheOptimist

🐱

Caused by: #44620 was synchronize by jmsadair.

see: more, trace.

James Adair added 5 commits April 24, 2026 03:29
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
@jmsadair

Copy link
Copy Markdown
Contributor Author

/retest

Comment on lines +53 to +56
// Async-signal-safe: reads a thread-local cached on each watched thread when
// it registered with the watchdog (see worker_impl.cc / server.cc), so this
// is just a TLS load by the time we reach the signal handler.
const int64_t mytid = Thread::getCurrentThreadId();

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to note here is that I'm still not 100% sure this is actually async signal safe even if it is guaranteed that the thread local TID has already been initialized. From what I have gathered, it seems like it's possible in some cases that a lock may be acquired when accessing TLS. Although, it seems rather unlikely here.

It might be possible to come up with some other scheme for claiming slots, but I don't have one in mind at the moment. Alternatively, we can just use pipes to communicate the backtrace a thread that isn't handling the signal since write is guaranteed to be async signal safe. Pipes have some caveats too, though.

@KBaichoo KBaichoo self-assigned this Apr 27, 2026
Signed-off-by: James Adair <jadair@netflix.com>
Comment thread test/coverage.yaml Outdated
James Adair added 2 commits May 2, 2026 00:05
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
@jmsadair

jmsadair commented May 2, 2026

Copy link
Copy Markdown
Contributor Author

/coverage

@repokitteh-read-only

Copy link
Copy Markdown

Coverage for this Pull Request will be rendered here:

https://storage.googleapis.com/envoy-cncf-pr/44620/coverage/index.html

For comparison, current coverage on main branch is here:

https://storage.googleapis.com/envoy-cncf-postsubmit/main/coverage/index.html

The coverage results are (re-)rendered each time the CI Envoy/Checks (coverage) job completes.

🐱

Caused by: a #44620 (comment) was created by @jmsadair.

see: more, trace.

Signed-off-by: James Adair <jadair@netflix.com>
@jmsadair jmsadair requested a review from RyanTheOptimist May 2, 2026 02:04
James Adair added 3 commits May 3, 2026 21:45
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
@jmsadair

jmsadair commented May 8, 2026

Copy link
Copy Markdown
Contributor Author

/retest

Comment thread test/coverage.yaml

@KBaichoo KBaichoo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience, gave this a pass

/wait

Comment thread changelogs/current.yaml Outdated
Comment thread source/extensions/watchdog/backtrace_action/backtrace_action.cc Outdated
Comment thread source/server/backtrace.h Outdated
Comment thread source/common/signal/non_fatal_signal_action.cc
Comment thread source/common/signal/non_fatal_signal_handler.cc Outdated
Comment thread source/common/signal/non_fatal_signal_handler.cc
James Adair added 4 commits June 7, 2026 17:18
…tion

# Conflicts:
#	changelogs/current.yaml

Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
James Adair added 2 commits June 8, 2026 17:13
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Comment on lines +34 to +43
bool signalThread(const ThreadId& tid, int signal) {
#ifdef __linux__
return syscall(SYS_tgkill, getpid(), toPlatformTid(tid.getId()), signal) == 0;
#else
// Only Linux supports the tgkill system call.
ENVOY_LOG_MISC(error, "signalThread is only supported on Linux.");
return false;
#endif
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KBaichoo FYI I had to update this to use tgkill instead of kill. The kill system call cannot be used to signal particular TIDs (I'm actually surprised this would work with the abort action). Unfortunately, tgkill is only supported on Linux, though.

There is pthread_kill which would work with MacOS and Linux, but we need pthread_t to call it and I don't think it's trivial to get that where we are calling this.

See here for reference:

@RyanTheOptimist

Copy link
Copy Markdown
Contributor

Coverage LGTM

James Adair added 2 commits June 14, 2026 16:35
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
@jmsadair

Copy link
Copy Markdown
Contributor Author

/retest

@jmsadair jmsadair requested a review from KBaichoo June 15, 2026 00:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants