[improve][broker] Add reset cursor latency metric#26088
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves broker observability and robustness around subscription cursor resets by adding a Prometheus latency metric for reset-cursor operations, hardening the binary protocol seek error response path against null causes, and enabling a managed-ledger read-entry timeout by default to reduce the risk of stuck reads blocking cursor operations.
Changes:
- Added
pulsar_subscription_reset_cursor_latency_msSummary metric (labels:topic,subscription,result) and recorded latency for both timestamp- and position-based reset cursor paths, including early fenced failures. - Made seek/reset-cursor error responses safer by unwrapping completion exceptions and using a null-safe message fallback.
- Changed
managedLedgerReadEntryTimeoutSecondsdefault to120seconds and updated template configs/comments to recommend keeping it enabled (with0as the disable value).
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java | Unwraps completion exceptions in seek error handling and adds null-safe message formatting. |
| pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentSubscription.java | Introduces and records a Prometheus Summary metric for reset cursor latency (success/failure). |
| pulsar-broker-common/src/main/java/org/apache/pulsar/broker/ServiceConfiguration.java | Updates the default managed-ledger read-entry timeout and expands its documentation. |
| deployment/terraform-ansible/templates/broker.conf | Updates the default managed-ledger read-entry timeout and documents 0 as disable. |
| conf/standalone.conf | Updates the default managed-ledger read-entry timeout and documents 0 as disable. |
| conf/broker.conf | Updates the default managed-ledger read-entry timeout and documents 0 as disable. |
|
This metric may leak high-cardinality label children.
Could we either avoid topic/subscription labels for this static summary, or record this through subscription-scoped stats with lifecycle cleanup? If keeping these labels is required, we should add cleanup on subscription/topic deletion that removes both the collector child and its associated summary logger. |
Motivation
Follow-up for investigating slow subscription cursor resets. When reset cursor is slow or fails, brokers currently lack a direct latency metric for the operation, and the binary protocol seek error path can dereference a missing cause while building the error response.
Modifications
pulsar_subscription_reset_cursor_latency_mssummary withtopic,subscription, andresultlabels.managedLedgerReadEntryTimeoutSecondsby setting the broker/standalone/template defaults to120seconds and documenting0as the disable value.Verifying this change
./gradlew :pulsar-broker-common:compileJava :pulsar-broker:compileJava./gradlew :pulsar-broker:test --tests org.apache.pulsar.broker.service.SubscriptionSeekTest.testSeek --tests org.apache.pulsar.broker.service.SubscriptionSeekTest.testSeekByTimestamp --tests org.apache.pulsar.broker.service.SubscriptionSeekTest.testConcurrentResetCursor