Batch node creation to avoid oversized Bolt transactions#318
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Mirror the edge-batching pattern on
create_nodes: chunknodeListin Python (batch_size=5000) and send oneexecute_writeper chunk, with progress logging.Motivation
In production we observed the Neo4j driver hitting:
mid-write while a graph job was persisting nodes for a large repo. Root cause:
create_nodespasses the entirenodeListin a singleexecute_writecall. For large repos this is a multi-MB UNWIND payload that the server takes long enough to process that AuraDB / network middleboxes close the underlying connection — the next driver call surfaces asdefunct connection, and the routing-table refresh that follows fails too.create_edgesalready batches at 10,000 perexecute_write.create_nodesdid not — same wire-payload problem, just on the nodes side.Behavior
len(nodeList)items.len(nodeList)/ 5000) transactions, each carrying ≤ 5,000 nodes. The innerapoc.periodic.iteratebatch size stays at 1,000 (unchanged), so server-side processing semantics are identical.create_edgesstyle:Creating N nodes in batches of 5000Processing nodes batch X/Y (i/N)Why 5,000 instead of edges' 10,000: node payloads carry
code_textand full attribute maps, so each item is materially bigger than an edge tuple. 5K keeps the per-batch wire payload comparable to the edges path.Note on dev
There is an open PR #317 (dev → main). dev is ~704 commits behind main (last sync 2025-02-28) and that promotion is unsafe as-is — it would delete ~87K lines including main's test suite. This PR goes directly to main to avoid that path. The earlier PR #316 already landed the equivalent change on
dev's legacy file path (blarify/db_managers/neo4j_manager.py), so the fix is captured there too ifdevis ever rebuilt from main.Test plan
poetry run ruff check blarify/repositories/graph_db_manager/neo4j_manager.py— clean.poetry run pyright …/neo4j_manager.py— no new errors (one pre-existing override-mismatch on line 218 is out of scope).create_graphagainst a large repo and watch forCreating N nodes in batches of 5000/ per-batch lines in logs; confirm nodefunct connection/ routing errors mid-write.