Default low_memory=False when reading CSVs with the C parser#322
Merged
Conversation
pandas' default low_memory=True reads a CSV in chunks and infers each
column's dtype per chunk.
For mostly-null or mixed-type metadata columns this is non-deterministic
across runs and emits a DtypeWarning ("Columns (N) have mixed types"),
so the same file can load as object on one machine and float64 on another.
Default low_memory=False so dtype inference is a single, deterministic,
quiet pass.
The option is only valid for the C parser, so it is left untouched for the
python and pyarrow engines (which reject it), and an explicit caller value
still wins.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
ScmRun(<csv>)reads viapandas.read_csvwith pandas' defaultlow_memory=True, which infers each column's dtype per chunk. For mostly-null or mixed-type metadata columns this is non-deterministic across runs/machines/pandas versions and emitsDtypeWarning: Columns (N) have mixed types— the same file can load asobjecton one machine andfloat64on another, which breaks downstream equality and regression checks.This defaults
low_memory=Falsein_read_pandasso dtype inference is a single, deterministic, quiet pass. It is scoped to the C parser only — thepythonandpyarrowengines rejectlow_memory(raiseValueError) — and an explicit caller-providedlow_memorystill wins.Trade-off
low_memory=Falseraises peak memory during the parse, because the C parser buffers the whole file's tokenised data to infer each column's dtype in one pass instead of chunk-by-chunk. Measured on a 233 MB wide timeseries CSV (60k rows × 206 cols, isolated process, peak RSS): 436 MB withlow_memory=Truevs 995 MB withlow_memory=False— roughly 2× peak and ≈2.4× the file size, while the resulting DataFrame is identical (114 MB) either way. The cost is confined to the CSV read path (low_memoryonly affectsread_csv) and scales with file size, so it is negligible for small CSVs; anyone reading a very large CSV can passlow_memory=Trueexplicitly to opt back into the chunked path.Please confirm that this pull request has done the following: