fix(tmq): modify the condition of checking vnode split finished#35273
fix(tmq): modify the condition of checking vnode split finished#35273wangmm0220 wants to merge 1 commit intomainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
There was a problem hiding this comment.
Pull request overview
Updates TMQ test utilities to make vnode-split completion checks more strict/stable, addressing flaky split-finish detection in the TMQ vnode-split test suite.
Changes:
- Adds a consecutive-success counter when polling
information_schema.ins_vnodesto confirm split completion. - Increases the polling sleep interval and logs query results during polling.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| while True: | ||
| tdSql.query("select * from information_schema.ins_vnodes where db_name='dbt' and status='leader' and restored=true") | ||
| tdLog.info(tdSql.queryResult) | ||
| if tdSql.getRows() == 2: | ||
| cnt += 1 |
There was a problem hiding this comment.
checkSplitVgroups() can loop forever if the expected vnode state is never reached (e.g., split fails or the environment differs), which can hang the entire test run. Please add a bounded timeout/retry limit and fail fast (e.g., tdLog.exit(...) with the last observed rows/result) similar to other cluster wait helpers that cap retries (e.g., ClusterComCheck.check_vgroups_status() uses count < count_number and exits on failure).
| tdSql.query("select * from information_schema.ins_vnodes where db_name='dbt' and status='leader' and restored=true") | ||
| tdLog.info(tdSql.queryResult) | ||
| if tdSql.getRows() == 2: |
There was a problem hiding this comment.
This loop logs tdSql.queryResult at INFO on every poll; in this repo, raw queryResult dumps are typically logged at DEBUG to avoid noisy CI logs (see test/new_test_framework/utils/clusterCommonCheck.py where tdLog.debug(tdSql.queryResult) is used). Consider switching this to tdLog.debug(...) and, since you only use the row count, querying select count(*) instead of select * to reduce output and query overhead.
| tdSql.query("select * from information_schema.ins_vnodes where db_name='dbt' and status='leader' and restored=true") | |
| tdLog.info(tdSql.queryResult) | |
| if tdSql.getRows() == 2: | |
| tdSql.query("select count(*) from information_schema.ins_vnodes where db_name='dbt' and status='leader' and restored=true") | |
| tdLog.debug(tdSql.queryResult) | |
| if tdSql.queryResult[0][0] == 2: |
| if tdSql.getRows() == 2: | ||
| cnt += 1 | ||
| else: | ||
| cnt = 0 | ||
| if cnt >= 10: |
There was a problem hiding this comment.
Requiring tdSql.getRows() == 2 to be observed 10 consecutive times plus sleep(2) makes this wait at least ~20s after the condition first becomes true, and it will noticeably slow multiple TMQ vnode-split tests that call this helper. If the intent is to avoid transient states, consider waiting for a stability duration (e.g., condition true for N seconds) with a small poll interval, or using a smaller consecutive-count threshold, so successful splits don't always pay the extra 20s.
Description
Issue(s)
Checklist
Please check the items in the checklist if applicable.