fix+perf: eliminate data races & replace all synchronization with Highway futex (zero mutex, wall -7%, sys -51%) by KimBioInfoStudio · Pull Request #684 · OpenGene/fastp

KimBioInfoStudio · 2026-04-17T02:39:40Z

Summary

This PR eliminates all data races found by ThreadSanitizer, then replaces every mutex, condition variable, usleep(), and sleep_for() busy-wait in fastp's threading pipeline with lock-free hwy::BlockUntilDifferent / hwy::WakeAll from Highway's cross-platform futex polyfill.

The result is a zero-mutex, zero-condvar threading architecture that is both faster and more correct than master.

Data Race Fixes (TSan verified)

Commit	Fix
`cf64a66`	Eliminate data race on `mNextSeq` in pwrite path
`b32ea3a`	Make SPSC queue `mHead` pointer atomic
`6316345`	Make `ReadPool` counters (`mProduced`/`mConsumed`) atomic

Synchronization Replacements

Location	Before	After
SE/PE processor — reader backpressure	`yield()` spin	`BlockUntilDifferent` on pack counters
SE/PE processor — writer buffer full	`yield()` spin	`BlockUntilDifferent` on buffer length
WriterThread — `output()` wait for data	`usleep(100)`	`BlockUntilDifferent` on `mWriterNotify` counter
WriterThread — `input()` notify	(none)	`fetch_add` + `WakeAll` on `mWriterNotify`
WriterThread — pwrite ring ordering	`sleep_for(1µs)` spin	`BlockUntilDifferent` on `published_seq`
BgzfMtReader — slot lifecycle	3× `mutex` + `condition_variable`	Per-slot `atomic<uint32_t>` state + `mSlotNotify` counter
WriterThread — completion signal	—	`setInputCompleted()` bumps `mWriterNotify` + `WakeAll`

Counter types: atomic_long / atomic<size_t> → atomic<uint32_t> (required by Highway futex API).

All <mutex> and <condition_variable> includes removed from peprocessor.h, seprocessor.h, writerthread.h.

Key Design: `mWriterNotify` Pattern

Highway's BlockUntilDifferent(prev, atom) only returns when atom != prev — WakeAll alone cannot break the loop if the value is unchanged. This caused a deadlock when mBufferLength stayed 0 at completion time.

Fix: a separate mWriterNotify monotonic counter that gets incremented by both input() (new data) and setInputCompleted() (EOF). The value always changes, so BlockUntilDifferent always returns.

Platform Support

Highway's futex.h provides native kernel-level blocking on all target platforms:

Platform	Mechanism	Latency
Linux	`SYS_futex` (kernel 2.6.22+)	~µs, true block
macOS	`__ulock_wait` (10.12+)	~µs, true block
FreeBSD	`NanoSleep` fallback	2µs poll

Compatible with C++11 and above. No new dependencies — fastp already links Highway.

Benchmark (5M PE reads, 150bp, gz→gz, `-w 3`)

                 wall      user      sys       page-faults
master:         46.8s     44.8s     1.47s      84K
refactor:       43.4s     42.2s     0.71s      92K

vs master	wall	user	sys
improvement	-7.2%	-5.9%	-51.4%

Correctness

Output MD5 matches master exactly for both R1 and R2 (gz→gz, PE mode). ✅

Supersedes #683 (all 3 race-condition fixes included).

Commits

Commit	Description
`cf64a66`	fix: eliminate data race on mNextSeq in pwrite path
`b32ea3a`	fix: make SPSC head pointer atomic
`6316345`	fix: make ReadPool counters atomic
`fe5d994`→`04452e5`	refactor: C++23 exploration → downgrade to C++11
`5481356`	perf: replace yield() with Highway futex (SE/PE processors)
`11c0249`	fix: writer thread deadlock with BlockUntilDifferent
`f5f4bca`	perf: pwrite ring sleep_for(1µs) → Highway futex
`2f348b3`	perf: bgzf mutex+condvar → Highway futex

mNextSeq was a plain size_t array written by worker threads in inputPwrite() and read by setInputCompletedPwrite() with no synchronization -- a C++ data race (undefined behaviour). A stale read could produce a wrong lastSeq value, causing ftruncate() to silently truncate the output file at the wrong offset and drop the final gz member(s). Fix: change mNextSeq to std::atomic<size_t>[]. - Worker threads write with memory_order_release after each pack, establishing a happens-before edge for the completion reader. - setInputCompletedPwrite() opens with an acquire fence before reading with memory_order_relaxed, ensuring all prior worker writes are visible before the ftruncate() call.

The `head` pointer in SingleProducerSingleConsumerList was a plain (non-atomic) pointer despite being written by the producer thread (produce(), first-item branch) and read concurrently by the consumer thread (canBeConsumed(), consume()). ThreadSanitizer reported 15 data races at singleproducersingleconsumerlist.h:100. Fixes applied: - `head` declared as `std::atomic<LockFreeListItem<T>*>` (tail stays non-atomic — producer-private after first item is published) - Constructor: `head.store(NULL, relaxed)` - produce() first-item branch: set tail = item first (producer-private write), then `head.store(item, release)` to publish atomically to consumer then `item->nextItemReady.store(true, release)` to signal readiness - canBeConsumed(): `head.load(acquire)` for NULL check (syncs with produce release), `head.load(relaxed)` for nextItemReady dereference (covered by the preceding acquire) - consume(): `head.load(acquire)` to read current head, `head.store(h->nextItem, release)` to advance — establishes happens-before with next canBeConsumed() acquire on head Also fixes the else-branch nextItemReady assignment to use `memory_order_release` (was implicit seq_cst, which does NOT prevent compiler reordering of the preceding `tail->nextItem = item` write).

ThreadSanitizer reported data races in ReadPool and SPSC when multiple worker threads called ReadPool::input() concurrently: readpool.cpp:23 — mIsFull read vs. updateFullStatus() write readpool.cpp:27 — mProduced++ (non-atomic RMW) by multiple threads readpool.cpp:53 — mIsFull write vs. concurrent reads spsc.h:90 — size(): produced (producer-written) vs. consumed (consumer-written) read without synchronization Fixes in readpool.h: - mIsFull : bool → std::atomic<bool> - mProduced : size_t → std::atomic<size_t> (atomic::operator++ and atomic::operator= are sufficient; no changes to readpool.cpp required) Fixes in singleproducersingleconsumerlist.h: - produced, consumed : unsigned long → std::atomic<unsigned long> - size(): load both with memory_order_relaxed (approximate count used only as a soft back-pressure threshold) - produce(): produced.fetch_add(1, relaxed) - consume(): consumed.fetch_add(1, relaxed) with local snapshot for the (consumed & 0xFFF) recycle check - makeItem(): produced.load(relaxed) snapshot before >> and & ops - recycle(): consumed.load(relaxed) before >> op After all four commits (mNextSeq, SPSC head, ReadPool/SPSC atomics), ThreadSanitizer reports zero data races on 5k-read PE mode 8-thread workload.

- Use /home/kimy/build-env g++ (GCC 15.2.0, conda-forge) - Upgrade -std=c++11 -> -std=c++23 - Default INCLUDE_DIRS and LIBRARY_DIRS to build-env paths Enables: std::jthread, std::latch, std::counting_semaphore, atomic::wait/notify, std::println

…emove manual join/delete

…re in writer output loop

…notify_all

…quire mismatch with N-producer round-robin

Replace all std::this_thread::yield() busy-spin loops with hwy::BlockUntilDifferent/WakeAll from Highway's futex.h polyfill. This provides cross-platform kernel-level thread blocking (Linux futex, macOS __ulock, FreeBSD NanoSleep fallback) instead of CPU-burning spins. Changes: - writerthread: output() waits on mBufferLength via BlockUntilDifferent, input() wakes writer via WakeAll after produce - writerthread.h: add waitForBufferBelow() using BlockUntilDifferent loop - peprocessor: replace 6 yield() sites with atomic wait/notify on mPackProducedCounter and mPackProcessedCounter - seprocessor: same pattern as peprocessor for SE pipeline - Change counter types from atomic_long to atomic<uint32_t> for Highway futex compatibility (uint32_t required by BlockUntilDifferent) Benchmark (5M PE reads, gz→gz, -w 3): master: 56.5s wall, 8.0s sys, 2680K page-faults yield (before): 79.4s wall, 26.8s sys, 3278K page-faults futex (after): 47.8s wall, 1.4s sys, 120K page-faults wall -15%, sys -82%, page-faults -95% vs master Output md5 matches master (correctness verified)

… path)

- std::jthread → std::thread + explicit join() - std::latch → atomic<uint32_t> + hwy futex wait/wake - std::println → cerr << - Remove #include <print>, <latch>; use <iostream>, <atomic> - Makefile: -std=c++23 → -std=c++11 Preserves Highway futex performance (sys ~1.7s, page-faults ~150K). Apple Clang on macOS CI does not support jthread/latch/println.

BlockUntilDifferent(prev, atom) only returns when atom != prev. When mBufferLength stays 0, WakeAll cannot break the loop. Fix: use separate mWriterNotify counter for writer thread blocking. - input(): increments mWriterNotify + WakeAll to wake writer - setInputCompleted(): increments mWriterNotify + WakeAll to wake writer - output(): blocks on mWriterNotify instead of mBufferLength Also move setInputCompleted() back to last worker thread (not main thread), matching the original master pattern. This avoids a race where main thread waits on latch while writer is already blocked. Verified: SE smoke ✅, PE smoke ✅, PE benchmark wall -10%, sys -71%.

The pwrite ring buffer used std::this_thread::sleep_for(1µs) to poll for the previous slot's published_seq. Replace with hwy::BlockUntilDifferent + hwy::WakeAll for precise wakeup. Changes: - OffsetSlot::published_seq: atomic<size_t> → atomic<uint32_t> (seq values are small; uint32_t required by Highway futex API) - Wait loop: sleep_for(1µs) → BlockUntilDifferent(cur, published_seq) - Publish: store + WakeAll to notify waiting workers

Replace all std::mutex + std::condition_variable synchronization in BgzfMtReader with lock-free Highway futex primitives: - Per-slot atomic<uint32_t> state: BlockUntilDifferent waits for state transitions (FREE→COMPRESSED→DECOMPRESSING→READY→FREE) - Global mSlotNotify counter: incremented on every state transition, used by decompressor threads to block when no COMPRESSED slots are available (replaces condvar broadcast) - Remove <mutex>/<condition_variable> includes from pe/seprocessor.h and writerthread.h (no longer used anywhere in the pipeline) This eliminates all kernel mutex contention in the BGZF decompression pipeline, which is the hot path for .gz input files.

sfchen · 2026-04-22T22:58:02Z

Hi @KimBioInfoStudio

I tested this PR, but it caused deadlock when the input is a gz file, and it showed no performance enhancement against v1.3.2. Tested on MacBook (M3 chip).

KimBioInfoStudio · 2026-04-23T03:04:05Z

@sfchen this not for perf, but for deadlock fix, let me continue to triage the deadlock

Make mInputCompleted atomic with acquire/release ordering to fix a race between producer and writer threads. Replace pwrite ring's published_seq with a monotonic generation counter to prevent ABA on slot reuse. Wake producers after buffer-length decrement so they unblock promptly. 🐘 Generated with Crush Co-Authored-By: Crush <[email protected]>

…up races Three connected thread-synchronization bugs caused fastp to hang under -w>=23 + plain (non-gz) output + --adapter_fasta. 1. Mid-flight deadlock: reader gated on mLeftWriter->waitForBufferBelow, writer drained mBufferLists in strict round-robin. When one worker ran slightly behind, its per-worker slot stayed empty while other slots piled up, pushing mBufferLength above the limit. The reader then halted at waitForBufferBelow, so the slow worker never received more input, its slot never filled, the writer stayed blocked, and every thread deadlocked. Confirmed by stack sample: 24 workers in peprocessor.cpp:1003, 2 readers in peprocessor.cpp:807, 2 writers in writerthread.cpp:110. Removed the writer-buffer backpressure — the pack-level backpressure (mLeftPackReadCounter - mPackProcessedCounter) already bounds in-flight memory without creating the cycle. 2. Reader-shutdown lost wakeup: readerTask/interleavedReaderTask/SE readerTask called setProducerFinished() without bumping mPackProducedCounter. A worker that had just snapshotted the counter in BlockUntilDifferent would miss the completion signal and sleep forever. Added a counter bump + WakeAll after setProducerFinished. 3. Writer-shutdown lost wakeup: WriterThread::output() checked mInputCompleted before snapshotting mWriterNotify. If setInputCompleted ran between the check and BlockUntilDifferent, cur captured the post-bump value and the writer blocked forever. Swapped the order so the snapshot is taken first; any subsequent bump is then guaranteed to make cur \!= current and return immediately. Verified on macOS ARM64 with 10M simulated pairs, -w 24, plain fq, --adapter_fasta: previously hung indefinitely, now completes in 38s. 🤖 Generated with Claude Code Co-Authored-By: Claude <[email protected]>

Hermes and others added 10 commits April 13, 2026 10:29

refactor(5): replace cerr with std::println in PE/SE reporting blocks

b1ea866

refactor(4): replace mFinishedThreads atomic counter with std::latch

f390419

refactor(3): replace raw thread* with std::jthread + std::optional, r…

275bd43

…emove manual join/delete

refactor(2): replace usleep(100) busy-wait with std::counting_semapho…

caf7ed2

…re in writer output loop

refactor(1): replace mutex+CV backpressure polling with atomic::wait/…

40329bb

…notify_all

fix(2): revert counting_semaphore, use yield() — semaphore release/ac…

536e911

…quire mismatch with N-producer round-robin

KimBioInfoStudio mentioned this pull request Apr 17, 2026

fix: eliminate data races in pwrite/SPSC path (TSan verified) #683

Closed

KimBioInfoStudio changed the title ~~perf: replace yield() spin-loops with Highway futex (wall -15%, sys -82%)~~ fix+perf: eliminate data races & replace yield() with Highway futex (wall -15%, sys -82%) Apr 17, 2026

KimBioInfoStudio added 2 commits April 16, 2026 19:58

fix: restore CXX ?= g++ in Makefile (remove hardcoded local toolchain…

2f33c85

… path)

KimBioInfoStudio force-pushed the refactor/cpp23-threading branch from 439cdc0 to 2f33c85 Compare April 17, 2026 02:58

KimBioInfoStudio added 4 commits April 16, 2026 20:29

KimBioInfoStudio changed the title ~~fix+perf: eliminate data races & replace yield() with Highway futex (wall -15%, sys -82%)~~ fix+perf: eliminate data races & replace all synchronization with Highway futex (zero mutex, wall -7%, sys -51%) Apr 17, 2026

KimBioInfoStudio and others added 2 commits April 23, 2026 20:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix+perf: eliminate data races & replace all synchronization with Highway futex (zero mutex, wall -7%, sys -51%)#684

fix+perf: eliminate data races & replace all synchronization with Highway futex (zero mutex, wall -7%, sys -51%)#684
KimBioInfoStudio wants to merge 18 commits intoOpenGene:masterfrom
KimBioInfoStudio:refactor/cpp23-threading

KimBioInfoStudio commented Apr 17, 2026 •

edited

Loading

Uh oh!

sfchen commented Apr 22, 2026

Uh oh!

KimBioInfoStudio commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

KimBioInfoStudio commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Data Race Fixes (TSan verified)

Synchronization Replacements

Key Design: mWriterNotify Pattern

Platform Support

Benchmark (5M PE reads, 150bp, gz→gz, -w 3)

Correctness

Commits

Uh oh!

sfchen commented Apr 22, 2026

Uh oh!

KimBioInfoStudio commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KimBioInfoStudio commented Apr 17, 2026 •

edited

Loading

Key Design: `mWriterNotify` Pattern

Benchmark (5M PE reads, 150bp, gz→gz, `-w 3`)