watchdog: add new action to capture backtraces#44620
Conversation
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
|
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
|
CC @envoyproxy/coverage-shephards: FYI only for changes made to |
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
|
/retest |
| // Async-signal-safe: reads a thread-local cached on each watched thread when | ||
| // it registered with the watchdog (see worker_impl.cc / server.cc), so this | ||
| // is just a TLS load by the time we reach the signal handler. | ||
| const int64_t mytid = Thread::getCurrentThreadId(); |
There was a problem hiding this comment.
Something to note here is that I'm still not 100% sure this is actually async signal safe even if it is guaranteed that the thread local TID has already been initialized. From what I have gathered, it seems like it's possible in some cases that a lock may be acquired when accessing TLS. Although, it seems rather unlikely here.
It might be possible to come up with some other scheme for claiming slots, but I don't have one in mind at the moment. Alternatively, we can just use pipes to communicate the backtrace a thread that isn't handling the signal since write is guaranteed to be async signal safe. Pipes have some caveats too, though.
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
|
/coverage |
|
Coverage for this Pull Request will be rendered here: https://storage.googleapis.com/envoy-cncf-pr/44620/coverage/index.html For comparison, current coverage on https://storage.googleapis.com/envoy-cncf-postsubmit/main/coverage/index.html The coverage results are (re-)rendered each time the CI |
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
|
/retest |
KBaichoo
left a comment
There was a problem hiding this comment.
Thanks for your patience, gave this a pass
/wait
…tion # Conflicts: # changelogs/current.yaml Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
| bool signalThread(const ThreadId& tid, int signal) { | ||
| #ifdef __linux__ | ||
| return syscall(SYS_tgkill, getpid(), toPlatformTid(tid.getId()), signal) == 0; | ||
| #else | ||
| // Only Linux supports the tgkill system call. | ||
| ENVOY_LOG_MISC(error, "signalThread is only supported on Linux."); | ||
| return false; | ||
| #endif | ||
| } | ||
|
|
There was a problem hiding this comment.
@KBaichoo FYI I had to update this to use tgkill instead of kill. The kill system call cannot be used to signal particular TIDs (I'm actually surprised this would work with the abort action). Unfortunately, tgkill is only supported on Linux, though.
There is pthread_kill which would work with MacOS and Linux, but we need pthread_t to call it and I don't think it's trivial to get that where we are calling this.
See here for reference:
|
Coverage LGTM |
Signed-off-by: James Adair <jadair@netflix.com>
Signed-off-by: James Adair <jadair@netflix.com>
|
/retest |
Commit Message: watchdog: add new action to capture backtraces
Additional Description: Adds
envoy.watchdog.backtrace_action, a new watchdog action that captures stack traces of stuck threads. When triggered, the action signals each offending thread via SIGUSR2, which captures the trace in-place in the signal handler, then logs it on the dispatcher thread. A configurable per-thread cooldown (default: 10s) prevents trace spam on persistent stalls.Risk Level: Low.
Testing: Added unit tests.
Docs Changes: Updated
watchdog.rst.Release Notes: Added.
Platform Specific Features: This feature is only supported on Linux due to use of the
tgkillsystem call to send signals to threads.