OCPBUGS-81648: Fix orphaned shell processes in pod terminal on WebSocket disconnect#16299
OCPBUGS-81648: Fix orphaned shell processes in pod terminal on WebSocket disconnect#16299Leo6Leo wants to merge 1 commit intoopenshift:release-4.18from
Conversation
|
@Leo6Leo: This pull request references Jira Issue OCPBUGS-81648, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughThe Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 8 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (8 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Leo6Leo The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/proxy/proxy.go`:
- Around line 234-241: The defer cleanup in the connection handler writes to the
shared websocket backend (calling sendExecExitCommand(backend) and
backend.WriteControl/Close) without synchronizing with other writers (e.g., the
copyMsgs goroutine that uses writeMutex), risking concurrent writes; introduce
and use a dedicated backendWriteMutex (or reuse writeMutex consistently) to
protect all backend write calls: acquire backendWriteMutex before calling
sendExecExitCommand(backend), before backend.WriteControl(...), and before
backend.Close() in the defer, and ensure copyMsgs and any other code paths also
lock the same backendWriteMutex when performing
backend.WriteMessage/WriteControl to guarantee the gorilla/websocket one-writer
rule is respected.
- Around line 299-307: sendExecExitCommand currently always base64-encodes the
"exit\r" frame which only works for the base64.channel.k8s.io subprotocol;
change it to inspect backend.Subprotocol() inside sendExecExitCommand and
branch: if the negotiated subprotocol is base64.channel.k8s.io (or starts with
"base64.channel") keep the existing base64 text framing and send as
websocket.TextMessage, otherwise for binary channel subprotocols like
v4.channel.k8s.io or v5.channel.k8s.io send a websocket.BinaryMessage with a
leading channel byte 0 followed by the raw bytes of "exit\r"; add a debug log if
the subprotocol is unrecognized before sending.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: ecf7fe8e-4be6-42e6-85e7-14602cc72b4d
📒 Files selected for processing (1)
pkg/proxy/proxy.go
| defer func() { | ||
| if isExec { | ||
| sendExecExitCommand(backend) | ||
| } | ||
| closeMsg := websocket.FormatCloseMessage(websocket.CloseNormalClosure, "") | ||
| _ = backend.WriteControl(websocket.CloseMessage, closeMsg, time.Now().Add(websocketTimeout)) | ||
| backend.Close() | ||
| }() |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In github.com/gorilla/websocket, are concurrent calls to WriteMessage from multiple goroutines allowed on the same Conn, and is WriteControl the only write API that is safe concurrently with other methods?
💡 Result:
No, concurrent calls to WriteMessage from multiple goroutines on the same Conn are not allowed. Yes, WriteControl is safe concurrently with other methods (along with Close). The official documentation states: "Applications are responsible for ensuring that no more than one goroutine calls the write methods (NextWriter, SetWriteDeadline, WriteMessage, WriteJSON, EnableWriteCompression, SetCompressionLevel) concurrently... The Close and WriteControl methods can be called concurrently with all other methods." This is confirmed in the source code where WriteMessage acquires a write mutex (c.mu), but concurrent access triggers a panic "concurrent write to websocket connection". WriteControl does not acquire this mutex and can interleave with data messages per the WebSocket protocol.
Citations:
- 1: https://github.com/gorilla/websocket/blob/main/doc.go
- 2: https://github.com/gorilla/websocket/blob/main/conn.go
- 3: [FEATURE] Reduce support issues related to concurrent writes gorilla/websocket#957
- 4: [question] Emit ping via WriteMessage() or WriteControl()? gorilla/websocket#652
- 5: https://stackoverflow.com/questions/43225340/how-to-ensure-concurrency-in-golang-gorilla-websocket-package
🏁 Script executed:
# First, examine the sendExecExitCommand function around line 302-305
sed -n '300,310p' pkg/proxy/proxy.goRepository: openshift/console
Length of output: 579
🏁 Script executed:
# Next, find and examine the copyMsgs function
rg "func copyMsgs" -A 10 pkg/proxy/proxy.goRepository: openshift/console
Length of output: 324
🏁 Script executed:
# Also check the context around line 233-241 to understand the goroutine setup
sed -n '225,250p' pkg/proxy/proxy.goRepository: openshift/console
Length of output: 904
🏁 Script executed:
# Check the full goroutine setup and error handling
sed -n '240,270p' pkg/proxy/proxy.goRepository: openshift/console
Length of output: 968
🏁 Script executed:
# Also check if there's a writeMutex declaration and how it's passed to the goroutine
sed -n '220,245p' pkg/proxy/proxy.goRepository: openshift/console
Length of output: 854
🏁 Script executed:
# Check the goroutine launches - where copyMsgs is called with which mutex
sed -n '265,290p' pkg/proxy/proxy.goRepository: openshift/console
Length of output: 881
🏁 Script executed:
# Check if there's a separate mutex for backend or how the goroutines sync
sed -n '285,320p' pkg/proxy/proxy.goRepository: openshift/console
Length of output: 1284
Prevent concurrent backend data writes during deferred exec cleanup.
The sendExecExitCommand(backend) call in the defer block writes to the backend connection without synchronization, while the second copyMsgs goroutine (line 280) holds writeMutex when writing to the same backend connection. When one goroutine errors and returns, the defer executes while the other goroutine may still be writing to backend, causing concurrent WriteMessage calls that violate gorilla/websocket's one-writer contract and can trigger panics or drop the exit frame.
The suggested fix using a dedicated backendWriteMutex is appropriate.
Suggested fix
+ var backendWriteMutex sync.Mutex
defer func() {
if isExec {
+ backendWriteMutex.Lock()
sendExecExitCommand(backend)
+ backendWriteMutex.Unlock()
}
closeMsg := websocket.FormatCloseMessage(websocket.CloseNormalClosure, "")
_ = backend.WriteControl(websocket.CloseMessage, closeMsg, time.Now().Add(websocketTimeout))
backend.Close()
}()
...
- go func() { errc <- copyMsgs(&writeMutex, backend, frontend) }()
+ go func() { errc <- copyMsgs(&backendWriteMutex, backend, frontend) }()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/proxy/proxy.go` around lines 234 - 241, The defer cleanup in the
connection handler writes to the shared websocket backend (calling
sendExecExitCommand(backend) and backend.WriteControl/Close) without
synchronizing with other writers (e.g., the copyMsgs goroutine that uses
writeMutex), risking concurrent writes; introduce and use a dedicated
backendWriteMutex (or reuse writeMutex consistently) to protect all backend
write calls: acquire backendWriteMutex before calling
sendExecExitCommand(backend), before backend.WriteControl(...), and before
backend.Close() in the defer, and ensure copyMsgs and any other code paths also
lock the same backendWriteMutex when performing
backend.WriteMessage/WriteControl to guarantee the gorilla/websocket one-writer
rule is respected.
| // sendExecExitCommand sends "exit\r" to the exec session's STDIN channel | ||
| // using the base64.channel.k8s.io subprotocol to terminate the shell process | ||
| // and prevent orphaned processes when the frontend WebSocket disconnects. | ||
| func sendExecExitCommand(backend *websocket.Conn) { | ||
| exitCmd := base64.StdEncoding.EncodeToString([]byte("exit\r")) | ||
| if err := backend.WriteMessage(websocket.TextMessage, []byte("0"+exitCmd)); err != nil { | ||
| klog.V(4).Infof("Failed to send exit command to exec session: %v", err) | ||
| } | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
For Kubernetes exec over WebSocket, what are the stdin frame formats for base64.channel.k8s.io vs v4/v5.channel.k8s.io, and does writing "0"+base64(text) only apply to base64.channel.k8s.io?
💡 Result:
For Kubernetes exec over WebSocket, the stdin frame formats differ between base64.channel.k8s.io and the v4/v5.channel.k8s.io protocols as follows: base64.channel.k8s.io: Each message is prefixed with a single ASCII character channel ID ('0' for stdin) followed by the base64-encoded data. Server decodes the base64 data for stdin; client must base64-encode stdin input before sending. Example: to send "foo\n" on stdin, send '0' + base64("foo\n"). v4.channel.k8s.io and v5.channel.k8s.io: These are binary protocols. Each message is prefixed with a single byte channel ID (0x00 for stdin) followed by the raw binary data (no base64 encoding/decoding). Client sends raw bytes prefixed with 0x00 for stdin. Writing "0"+base64(text) (where "0" is the ASCII character) applies only to base64.channel.k8s.io. For v4/v5.channel.k8s.io, use binary byte 0x00 + raw text bytes. Note: v4 and v5 primarily differ in stdin close signaling (v5 supports proper CLOSE via channel 255), but stdin data framing is identical (binary). Base64 variants like v4.base64.channel.k8s.io may exist but are less common; they would follow base64.channel.k8s.io framing with '0' prefix.
Citations:
- 1: /exec endpoint with base64.channel.k8s.io does not decode data origin#4985
- 2: Support exec and pod logging over WebSockets kubernetes/kubernetes#13885
- 3: https://kubernetes-client.github.io/csharp/api/k8s.WebSocketProtocol.html
- 4: https://kubernetes.io/blog/2024/05/01/cri-streaming-explained/
- 5: WebSocket Client and V5 RemoteCommand Subprotocol kubernetes/kubernetes#119157
- 6: https://pkg.go.dev/k8s.io/Kubernetes/pkg/util/wsstream
- 7: Expose websocket header to support base64.channel.k8s.io or v4.base64.channel.k8s.io channel protocol kubernetes-client/python#2098
- 8: https://pkg.go.dev/github.com/containerd/containerd/pkg/cri/streaming/internal/wsstream
- 9: https://v1-33.docs.kubernetes.io/blog/2024/08/20/websockets-transition
🏁 Script executed:
# Search for sendExecExitCommand usage and definition
rg -n "sendExecExitCommand" pkg/proxy/ -A 5 -B 5Repository: openshift/console
Length of output: 1488
🏁 Script executed:
# Search for WebSocket subprotocol negotiation in the codebase
rg -n "Subprotocol\|base64\.channel\.k8s\.io\|channel\.k8s\.io" pkg/proxy/ -B 3 -A 3Repository: openshift/console
Length of output: 43
🏁 Script executed:
# Look for WebSocket connection setup and negotiation
rg -n "websocket\.(Dialer|Upgrader|Conn)" pkg/proxy/ -B 2 -A 5Repository: openshift/console
Length of output: 3112
🏁 Script executed:
# Check for any exec-related constants or protocol definitions
rg -n "exec.*protocol\|protocol.*exec" pkg/ -iRepository: openshift/console
Length of output: 43
🏁 Script executed:
# Find the function signature and where subProtocol is passed
rg -n "func.*websocket\|ProxyHandler.*subProtocol" pkg/proxy/proxy.go -B 5 -A 10Repository: openshift/console
Length of output: 43
🏁 Script executed:
# Check where sendExecExitCommand is used and trace back the subProtocol
rg -n "subProtocol" pkg/proxy/ -B 2 -A 2Repository: openshift/console
Length of output: 1370
🏁 Script executed:
# Look for all references to "base64" in the proxy code
rg -n "base64" pkg/proxy/Repository: openshift/console
Length of output: 461
🏁 Script executed:
# Check what websocket library is being used and if Conn has Subprotocol method
rg -n "import.*websocket" pkg/proxy/Repository: openshift/console
Length of output: 43
🏁 Script executed:
# Look for any existing uses of .Subprotocol() in the codebase
rg -n "\.Subprotocol\(\)" pkg/Repository: openshift/console
Length of output: 43
🏁 Script executed:
# Check proxy.go imports section
head -n 20 pkg/proxy/proxy.goRepository: openshift/console
Length of output: 341
🏁 Script executed:
# Search for websocket import more broadly
rg "websocket" pkg/proxy/proxy.go | head -20Repository: openshift/console
Length of output: 1144
🏁 Script executed:
# Check gorilla/websocket documentation or code - does Conn have Subprotocol method?
# Let's look at the test file to see if Subprotocol is used anywhere
cat -n pkg/proxy/proxy_test.go | head -50Repository: openshift/console
Length of output: 1471
🏁 Script executed:
# Search for any existing pattern of reading negotiated subprotocol
rg -n "Subprotocol" pkg/proxy/proxy_test.go -B 3 -A 3Repository: openshift/console
Length of output: 539
🏁 Script executed:
# Verify gorilla/websocket Conn.Subprotocol() method exists in documentation or go.mod
cat pkg/go.mod 2>/dev/null | grep gorilla || cat go.mod 2>/dev/null | grep gorilla || rg "gorilla/websocket" -g "*.mod"Repository: openshift/console
Length of output: 216
🏁 Script executed:
# Check if there's any way to access subprotocol from the connection in gorilla/websocket
# Look at the upgrader usage to see what information is available after upgrade
rg -n "Upgrade\|SubprotocolError" pkg/proxy/proxy.go -B 3 -A 5Repository: openshift/console
Length of output: 43
🏁 Script executed:
# Search gorilla/websocket v1.5.0 API - check if websocket.Conn has Subprotocol() method
python3 << 'EOF'
import subprocess
result = subprocess.run(["curl", "-s", "https://pkg.go.dev/github.com/gorilla/[email protected]?tab=doc"],
capture_output=True, text=True, timeout=10)
if "Subprotocol" in result.stdout:
print("Subprotocol method found in gorilla/websocket v1.5.0")
# Extract relevant lines
lines = result.stdout.split('\n')
for i, line in enumerate(lines):
if 'Subprotocol' in line:
print('\n'.join(lines[max(0, i-2):min(len(lines), i+5)]))
else:
print("Could not verify Subprotocol method - checking alternative approaches")
EOFRepository: openshift/console
Length of output: 130
🏁 Script executed:
# Check the upgraded websocket connection - how is subprotocol stored/accessed?
# Look at the Upgrader.Upgrade return values
rg -n "Upgrade.*frontend\|conn.*Upgrade" pkg/proxy/proxy.go -A 3Repository: openshift/console
Length of output: 43
🏁 Script executed:
# Check the Upgrader.Upgrade signature and return value
rg -n "frontend.*:=.*upgrader\.Upgrade\|websocket\.Upgrader.*Upgrade" pkg/proxy/proxy.go -A 10 -B 2Repository: openshift/console
Length of output: 43
🏁 Script executed:
# Let's verify gorilla/websocket API by checking pkg.go.dev directly
# gorilla/websocket Conn should have SubprotocolError or access to negotiated subprotocol
python3 << 'EOF'
# Check gorilla/websocket v1.5.0 documentation
import subprocess
try:
result = subprocess.run(
["curl", "-s", "https://raw.githubusercontent.com/gorilla/websocket/v1.5.0/conn.go"],
capture_output=True, text=True, timeout=10
)
# Look for Subprotocol method
if "func (c *Conn) Subprotocol" in result.stdout:
print("✓ Subprotocol() method exists in gorilla/websocket v1.5.0")
# Extract the method
lines = result.stdout.split('\n')
for i, line in enumerate(lines):
if 'func (c *Conn) Subprotocol' in line:
print('\n'.join(lines[i:min(len(lines), i+5)]))
break
else:
print("Checking Upgrader return value instead...")
if "SubprotocolError" in result.stdout or "subprotocol" in result.stdout.lower():
print("Found subprotocol-related code")
except Exception as e:
print(f"Error: {e}")
EOFRepository: openshift/console
Length of output: 258
Check which subprotocol is negotiated before hardcoding base64 framing in cleanup.
The sendExecExitCommand function assumes base64.channel.k8s.io, but the actual subprotocol is negotiated dynamically from client headers and can vary. For binary protocols like v4.channel.k8s.io or v5.channel.k8s.io, the frame format "0"+base64(text) is invalid—the cleanup will fail silently and leave orphaned processes. Use backend.Subprotocol() to gate the encoding logic.
Suggested refactor
func sendExecExitCommand(backend *websocket.Conn) {
- exitCmd := base64.StdEncoding.EncodeToString([]byte("exit\r"))
- if err := backend.WriteMessage(websocket.TextMessage, []byte("0"+exitCmd)); err != nil {
- klog.V(4).Infof("Failed to send exit command to exec session: %v", err)
+ switch backend.Subprotocol() {
+ case "base64.channel.k8s.io":
+ exitCmd := base64.StdEncoding.EncodeToString([]byte("exit\r"))
+ if err := backend.WriteMessage(websocket.TextMessage, []byte("0"+exitCmd)); err != nil {
+ klog.V(4).Infof("Failed to send exit command to exec session: %v", err)
+ }
+ default:
+ klog.V(4).Infof("Skipping exec exit command for unsupported websocket subprotocol: %q", backend.Subprotocol())
}
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/proxy/proxy.go` around lines 299 - 307, sendExecExitCommand currently
always base64-encodes the "exit\r" frame which only works for the
base64.channel.k8s.io subprotocol; change it to inspect backend.Subprotocol()
inside sendExecExitCommand and branch: if the negotiated subprotocol is
base64.channel.k8s.io (or starts with "base64.channel") keep the existing base64
text framing and send as websocket.TextMessage, otherwise for binary channel
subprotocols like v4.channel.k8s.io or v5.channel.k8s.io send a
websocket.BinaryMessage with a leading channel byte 0 followed by the raw bytes
of "exit\r"; add a debug log if the subprotocol is unrecognized before sending.
…nect When a terminal WebSocket connection drops, the shell process inside the container was never terminated, leading to accumulation of orphaned processes. This occurred because gorilla/websocket's Close() only drops the TCP connection without sending a WebSocket close frame, and no exit command was sent to the shell. Send "exit" to the exec STDIN channel and a proper WebSocket close frame to the Kubernetes API server when the proxy connection closes. Use a dedicated backendWriteMutex to prevent concurrent writes to the backend connection, and check the negotiated subprotocol to use the correct frame encoding.
12ece3f to
03f907a
Compare
|
@Leo6Leo: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
shprocessesClose()only drops the TCP connection without sending a WebSocket close frame, and noexitcommand was sent to the shell on disconnectexitthrough the exec STDIN channel (base64.channel.k8s.ioprotocol) and a proper WebSocket close frame to the Kubernetes API server when the proxy connection closesDetails
The cleanup in
componentWillUnmount(frontend) only runs when the user navigates away from the terminal page. When the WebSocket drops while the page stays loaded (e.g., load balancer idle timeout, network interruption), the shell process inside the container survives indefinitely. Each reconnection spawns a new shell, causing orphaned processes to accumulate. This is especially impactful for high-usage pods like DB2U.The fix adds cleanup at the proxy layer (
pkg/proxy/proxy.go), which handles all disconnect scenarios regardless of how the frontend behaves.Test plan
ps auxviaoc exec)ps aux)go test ./pkg/proxy/)oc rsh/oc exec) unaffected (fix only targets/execWebSocket path in the proxy)Summary by CodeRabbit