Summary
A ghost trace ID () was added to an episode's but never existed in the table. This caused to return indefinitely, triggering 590 reward rescore calls over 5 days and wasting ~14.5 RMB on LLM calls.
Root Cause
1. Ghost trace creation
The episode was reopened with . During the follow-up merge, was called which directly overwrites without validating that trace IDs actually exist in the table.
// episodes.js line 80-82
appendTrace(id, traceIds) {
appendTrace.run({ id, trace_ids_json: toJsonText(traceIds) });
// No validation that traceIds exist in traces table!
}
2. Reward traceCount mismatch
loads traces via which only returns existing traces:
// reward.js line 48-52
const traces = traceIds.length > 0
? deps.tracesRepo.getManyByIds(traceIds).sort(...)
: [];
This returns 11 traces (ghost excluded). After scoring: reward.traceCount = 11.
3. Dirty check compares against episode.traceIds.length
// memory-core.js line 1003-1006
const traceCount = reward.traceCount;
if (typeof traceCount === number) {
return traceCount !== (ep.traceIds?.length ?? 0);
// 11 !== 12 → true → dirty forever!
}
4. Infinite loop
runs every 10 minutes, finds the episode dirty, rescores, gets 11 vs 12 mismatch, episode stays dirty. Repeat 590 times.
Evidence
Episode: ep_95n61b3jzycd
trace_ids_json count: 12 (including ghost tr_xhbp6c9p450r)
tr_xhbp6c9p450r in traces table: 0 (does not exist)
reward.traceCount: 11
episodeRewardIsDirty: true (traceCount mismatch)
Reward calls: 590 over 5 days (06-18 to 06-22)
Suggested Fixes
- appendTrace validation: Validate trace IDs exist before appending
- Dirty check resilience: Compare against actual existing trace count, not episode.traceIds.length
- Rescore retry limit: Add max retry count per episode to prevent infinite loops
- Ghost trace cleanup: Add startup scan to remove orphaned trace IDs from episodes
Related Issues
Environment
- MemOS: v2.0.20
- Agent: hermes
- OS: Windows 11
Summary
A ghost trace ID () was added to an episode's but never existed in the table. This caused to return indefinitely, triggering 590 reward rescore calls over 5 days and wasting ~14.5 RMB on LLM calls.
Root Cause
1. Ghost trace creation
The episode was reopened with . During the follow-up merge, was called which directly overwrites without validating that trace IDs actually exist in the table.
2. Reward traceCount mismatch
loads traces via which only returns existing traces:
This returns 11 traces (ghost excluded). After scoring:
reward.traceCount = 11.3. Dirty check compares against episode.traceIds.length
4. Infinite loop
runs every 10 minutes, finds the episode dirty, rescores, gets 11 vs 12 mismatch, episode stays dirty. Repeat 590 times.
Evidence
Suggested Fixes
Related Issues
Environment