Skip to content

👌 fix quadratic complexity in fragments_join#389

Open
petricevich wants to merge 1 commit intoexecutablebooks:masterfrom
petricevich:fragments_join_worst_case_n_squared_fix
Open

👌 fix quadratic complexity in fragments_join#389
petricevich wants to merge 1 commit intoexecutablebooks:masterfrom
petricevich:fragments_join_worst_case_n_squared_fix

Conversation

@petricevich
Copy link
Copy Markdown

When emphasis/strikethrough postprocessing leaves a long run of adjacent text tokens (e.g. lots of intraword _ that can't open or close emphasis), the old code merged them pairwise:

state.tokens[curr + 1].content = state.tokens[curr].content + state.tokens[curr + 1].content

That's quadratic in the size of the run because every step rebuilds the growing prefix. Switched it to collect the run into a list and "".join once into the last token, which keeps the existing semantics (last token of the run is the one preserved, level is unchanged inside a run because text tokens have nesting=0).

Tested on an adversarial ~190 KB document with ~30k intraword underscores on a single line. With tracemalloc running:

render time peak Python alloc
before 2.2s 4476 MB
after 0.6s 23 MB

It's not just a contrived attack input - this kind of thing also shows up naturally in markdown produced by OCR pipelines, where tables of identifiers / references can easily contain very long runs of underscores or other delimiter characters.

Existing tests still pass.

When emphasis/strikethrough postprocessing leaves a long run of adjacent
text tokens (e.g. unmatched intraword `_` delimiters), fragments_join
merged them via pairwise `a + b` concatenation. Each step rebuilds the
growing prefix, costing O(L*k) per run.

Walk the whole run once, collect content into a list, and "".join into
the last token, making the work O(L). The kept token is still the last
in the run so its non-content attributes (markup, etc.) are preserved.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.81%. Comparing base (8933147) to head (f142dc8).

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #389      +/-   ##
==========================================
+ Coverage   95.80%   95.81%   +0.01%     
==========================================
  Files          64       64              
  Lines        3457     3467      +10     
==========================================
+ Hits         3312     3322      +10     
  Misses        145      145              
Flag Coverage Δ
pytests 95.81% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@chrisjsewell
Copy link
Copy Markdown
Member

Thanks, will double check soon, but sounds good in principle

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants