Skip to content

fix+perf: eliminate data races & replace all synchronization with Highway futex (zero mutex, wall -7%, sys -51%)#684

Open
KimBioInfoStudio wants to merge 18 commits intoOpenGene:masterfrom
KimBioInfoStudio:refactor/cpp23-threading
Open

fix+perf: eliminate data races & replace all synchronization with Highway futex (zero mutex, wall -7%, sys -51%)#684
KimBioInfoStudio wants to merge 18 commits intoOpenGene:masterfrom
KimBioInfoStudio:refactor/cpp23-threading

Conversation

@KimBioInfoStudio
Copy link
Copy Markdown
Member

@KimBioInfoStudio KimBioInfoStudio commented Apr 17, 2026

Summary

This PR eliminates all data races found by ThreadSanitizer, then replaces every mutex, condition variable, usleep(), and sleep_for() busy-wait in fastp's threading pipeline with lock-free hwy::BlockUntilDifferent / hwy::WakeAll from Highway's cross-platform futex polyfill.

The result is a zero-mutex, zero-condvar threading architecture that is both faster and more correct than master.

Data Race Fixes (TSan verified)

Commit Fix
cf64a66 Eliminate data race on mNextSeq in pwrite path
b32ea3a Make SPSC queue mHead pointer atomic
6316345 Make ReadPool counters (mProduced/mConsumed) atomic

Synchronization Replacements

Location Before After
SE/PE processor — reader backpressure yield() spin BlockUntilDifferent on pack counters
SE/PE processor — writer buffer full yield() spin BlockUntilDifferent on buffer length
WriterThreadoutput() wait for data usleep(100) BlockUntilDifferent on mWriterNotify counter
WriterThreadinput() notify (none) fetch_add + WakeAll on mWriterNotify
WriterThread — pwrite ring ordering sleep_for(1µs) spin BlockUntilDifferent on published_seq
BgzfMtReader — slot lifecycle mutex + condition_variable Per-slot atomic<uint32_t> state + mSlotNotify counter
WriterThread — completion signal setInputCompleted() bumps mWriterNotify + WakeAll

Counter types: atomic_long / atomic<size_t>atomic<uint32_t> (required by Highway futex API).

All <mutex> and <condition_variable> includes removed from peprocessor.h, seprocessor.h, writerthread.h.

Key Design: mWriterNotify Pattern

Highway's BlockUntilDifferent(prev, atom) only returns when atom != prevWakeAll alone cannot break the loop if the value is unchanged. This caused a deadlock when mBufferLength stayed 0 at completion time.

Fix: a separate mWriterNotify monotonic counter that gets incremented by both input() (new data) and setInputCompleted() (EOF). The value always changes, so BlockUntilDifferent always returns.

Platform Support

Highway's futex.h provides native kernel-level blocking on all target platforms:

Platform Mechanism Latency
Linux SYS_futex (kernel 2.6.22+) ~µs, true block
macOS __ulock_wait (10.12+) ~µs, true block
FreeBSD NanoSleep fallback 2µs poll

Compatible with C++11 and above. No new dependencies — fastp already links Highway.

Benchmark (5M PE reads, 150bp, gz→gz, -w 3)

                 wall      user      sys       page-faults
master:         46.8s     44.8s     1.47s      84K
refactor:       43.4s     42.2s     0.71s      92K
vs master wall user sys
improvement -7.2% -5.9% -51.4%

Correctness

Output MD5 matches master exactly for both R1 and R2 (gz→gz, PE mode). ✅

Supersedes #683 (all 3 race-condition fixes included).

Commits

Commit Description
cf64a66 fix: eliminate data race on mNextSeq in pwrite path
b32ea3a fix: make SPSC head pointer atomic
6316345 fix: make ReadPool counters atomic
fe5d99404452e5 refactor: C++23 exploration → downgrade to C++11
5481356 perf: replace yield() with Highway futex (SE/PE processors)
11c0249 fix: writer thread deadlock with BlockUntilDifferent
f5f4bca perf: pwrite ring sleep_for(1µs) → Highway futex
2f348b3 perf: bgzf mutex+condvar → Highway futex

Hermes and others added 10 commits April 13, 2026 10:29
mNextSeq was a plain size_t array written by worker threads in
inputPwrite() and read by setInputCompletedPwrite() with no
synchronization -- a C++ data race (undefined behaviour).

A stale read could produce a wrong lastSeq value, causing
ftruncate() to silently truncate the output file at the wrong
offset and drop the final gz member(s).

Fix: change mNextSeq to std::atomic<size_t>[].
- Worker threads write with memory_order_release after each pack,
  establishing a happens-before edge for the completion reader.
- setInputCompletedPwrite() opens with an acquire fence before
  reading with memory_order_relaxed, ensuring all prior worker
  writes are visible before the ftruncate() call.
The `head` pointer in SingleProducerSingleConsumerList was a plain
(non-atomic) pointer despite being written by the producer thread
(produce(), first-item branch) and read concurrently by the consumer
thread (canBeConsumed(), consume()).  ThreadSanitizer reported 15 data
races at singleproducersingleconsumerlist.h:100.

Fixes applied:
- `head` declared as `std::atomic<LockFreeListItem<T>*>` (tail stays
  non-atomic — producer-private after first item is published)
- Constructor: `head.store(NULL, relaxed)`
- produce() first-item branch:
    set tail = item first (producer-private write),
    then `head.store(item, release)` to publish atomically to consumer
    then `item->nextItemReady.store(true, release)` to signal readiness
- canBeConsumed():
    `head.load(acquire)` for NULL check (syncs with produce release),
    `head.load(relaxed)` for nextItemReady dereference (covered by
    the preceding acquire)
- consume():
    `head.load(acquire)` to read current head,
    `head.store(h->nextItem, release)` to advance — establishes
    happens-before with next canBeConsumed() acquire on head

Also fixes the else-branch nextItemReady assignment to use
`memory_order_release` (was implicit seq_cst, which does NOT prevent
compiler reordering of the preceding `tail->nextItem = item` write).
ThreadSanitizer reported data races in ReadPool and SPSC when multiple
worker threads called ReadPool::input() concurrently:

  readpool.cpp:23  — mIsFull read vs. updateFullStatus() write
  readpool.cpp:27  — mProduced++ (non-atomic RMW) by multiple threads
  readpool.cpp:53  — mIsFull write vs. concurrent reads
  spsc.h:90        — size(): produced (producer-written) vs. consumed
                     (consumer-written) read without synchronization

Fixes in readpool.h:
  - mIsFull : bool  → std::atomic<bool>
  - mProduced : size_t → std::atomic<size_t>
  (atomic::operator++ and atomic::operator= are sufficient;
   no changes to readpool.cpp required)

Fixes in singleproducersingleconsumerlist.h:
  - produced, consumed : unsigned long → std::atomic<unsigned long>
  - size(): load both with memory_order_relaxed (approximate count used
    only as a soft back-pressure threshold)
  - produce(): produced.fetch_add(1, relaxed)
  - consume(): consumed.fetch_add(1, relaxed) with local snapshot for
    the (consumed & 0xFFF) recycle check
  - makeItem(): produced.load(relaxed) snapshot before >> and & ops
  - recycle(): consumed.load(relaxed) before >> op

After all four commits (mNextSeq, SPSC head, ReadPool/SPSC atomics),
ThreadSanitizer reports zero data races on 5k-read PE mode 8-thread
workload.
- Use /home/kimy/build-env g++ (GCC 15.2.0, conda-forge)
- Upgrade -std=c++11 -> -std=c++23
- Default INCLUDE_DIRS and LIBRARY_DIRS to build-env paths

Enables: std::jthread, std::latch, std::counting_semaphore,
         atomic::wait/notify, std::println
@KimBioInfoStudio KimBioInfoStudio changed the title perf: replace yield() spin-loops with Highway futex (wall -15%, sys -82%) fix+perf: eliminate data races & replace yield() with Highway futex (wall -15%, sys -82%) Apr 17, 2026
Replace all std::this_thread::yield() busy-spin loops with
hwy::BlockUntilDifferent/WakeAll from Highway's futex.h polyfill.
This provides cross-platform kernel-level thread blocking (Linux futex,
macOS __ulock, FreeBSD NanoSleep fallback) instead of CPU-burning spins.

Changes:
- writerthread: output() waits on mBufferLength via BlockUntilDifferent,
  input() wakes writer via WakeAll after produce
- writerthread.h: add waitForBufferBelow() using BlockUntilDifferent loop
- peprocessor: replace 6 yield() sites with atomic wait/notify on
  mPackProducedCounter and mPackProcessedCounter
- seprocessor: same pattern as peprocessor for SE pipeline
- Change counter types from atomic_long to atomic<uint32_t> for
  Highway futex compatibility (uint32_t required by BlockUntilDifferent)

Benchmark (5M PE reads, gz→gz, -w 3):
  master:          56.5s wall, 8.0s sys, 2680K page-faults
  yield (before):  79.4s wall, 26.8s sys, 3278K page-faults
  futex (after):   47.8s wall, 1.4s sys, 120K page-faults

  wall -15%, sys -82%, page-faults -95% vs master
  Output md5 matches master (correctness verified)
@KimBioInfoStudio KimBioInfoStudio force-pushed the refactor/cpp23-threading branch from 439cdc0 to 2f33c85 Compare April 17, 2026 02:58
- std::jthread → std::thread + explicit join()
- std::latch → atomic<uint32_t> + hwy futex wait/wake
- std::println → cerr <<
- Remove #include <print>, <latch>; use <iostream>, <atomic>
- Makefile: -std=c++23 → -std=c++11

Preserves Highway futex performance (sys ~1.7s, page-faults ~150K).
Apple Clang on macOS CI does not support jthread/latch/println.
BlockUntilDifferent(prev, atom) only returns when atom != prev.
When mBufferLength stays 0, WakeAll cannot break the loop.

Fix: use separate mWriterNotify counter for writer thread blocking.
- input(): increments mWriterNotify + WakeAll to wake writer
- setInputCompleted(): increments mWriterNotify + WakeAll to wake writer
- output(): blocks on mWriterNotify instead of mBufferLength

Also move setInputCompleted() back to last worker thread (not main
thread), matching the original master pattern. This avoids a race
where main thread waits on latch while writer is already blocked.

Verified: SE smoke ✅, PE smoke ✅, PE benchmark wall -10%, sys -71%.
The pwrite ring buffer used std::this_thread::sleep_for(1µs) to poll
for the previous slot's published_seq. Replace with
hwy::BlockUntilDifferent + hwy::WakeAll for precise wakeup.

Changes:
- OffsetSlot::published_seq: atomic<size_t> → atomic<uint32_t>
  (seq values are small; uint32_t required by Highway futex API)
- Wait loop: sleep_for(1µs) → BlockUntilDifferent(cur, published_seq)
- Publish: store + WakeAll to notify waiting workers
Replace all std::mutex + std::condition_variable synchronization in
BgzfMtReader with lock-free Highway futex primitives:

- Per-slot atomic<uint32_t> state: BlockUntilDifferent waits for
  state transitions (FREE→COMPRESSED→DECOMPRESSING→READY→FREE)
- Global mSlotNotify counter: incremented on every state transition,
  used by decompressor threads to block when no COMPRESSED slots
  are available (replaces condvar broadcast)
- Remove <mutex>/<condition_variable> includes from pe/seprocessor.h
  and writerthread.h (no longer used anywhere in the pipeline)

This eliminates all kernel mutex contention in the BGZF decompression
pipeline, which is the hot path for .gz input files.
@KimBioInfoStudio KimBioInfoStudio changed the title fix+perf: eliminate data races & replace yield() with Highway futex (wall -15%, sys -82%) fix+perf: eliminate data races & replace all synchronization with Highway futex (zero mutex, wall -7%, sys -51%) Apr 17, 2026
@sfchen
Copy link
Copy Markdown
Member

sfchen commented Apr 22, 2026

Hi @KimBioInfoStudio

I tested this PR, but it caused deadlock when the input is a gz file, and it showed no performance enhancement against v1.3.2. Tested on MacBook (M3 chip).

@KimBioInfoStudio
Copy link
Copy Markdown
Member Author

@sfchen this not for perf, but for deadlock fix, let me continue to triage the deadlock

KimBioInfoStudio and others added 2 commits April 23, 2026 20:30
Make mInputCompleted atomic with acquire/release ordering to fix a
race between producer and writer threads. Replace pwrite ring's
published_seq with a monotonic generation counter to prevent ABA on
slot reuse. Wake producers after buffer-length decrement so they
unblock promptly.

🐘 Generated with Crush

Co-Authored-By: Crush <[email protected]>
…up races

Three connected thread-synchronization bugs caused fastp to hang under
-w>=23 + plain (non-gz) output + --adapter_fasta.

1. Mid-flight deadlock: reader gated on mLeftWriter->waitForBufferBelow,
   writer drained mBufferLists in strict round-robin. When one worker
   ran slightly behind, its per-worker slot stayed empty while other
   slots piled up, pushing mBufferLength above the limit. The reader
   then halted at waitForBufferBelow, so the slow worker never received
   more input, its slot never filled, the writer stayed blocked, and
   every thread deadlocked. Confirmed by stack sample: 24 workers in
   peprocessor.cpp:1003, 2 readers in peprocessor.cpp:807, 2 writers in
   writerthread.cpp:110. Removed the writer-buffer backpressure — the
   pack-level backpressure (mLeftPackReadCounter - mPackProcessedCounter)
   already bounds in-flight memory without creating the cycle.

2. Reader-shutdown lost wakeup: readerTask/interleavedReaderTask/SE
   readerTask called setProducerFinished() without bumping
   mPackProducedCounter. A worker that had just snapshotted the counter
   in BlockUntilDifferent would miss the completion signal and sleep
   forever. Added a counter bump + WakeAll after setProducerFinished.

3. Writer-shutdown lost wakeup: WriterThread::output() checked
   mInputCompleted before snapshotting mWriterNotify. If setInputCompleted
   ran between the check and BlockUntilDifferent, cur captured the
   post-bump value and the writer blocked forever. Swapped the order so
   the snapshot is taken first; any subsequent bump is then guaranteed
   to make cur \!= current and return immediately.

Verified on macOS ARM64 with 10M simulated pairs, -w 24, plain fq,
--adapter_fasta: previously hung indefinitely, now completes in 38s.

🤖 Generated with Claude Code
Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants