Skip to content

Default low_memory=False when reading CSVs with the C parser#322

Merged
lewisjared merged 2 commits into
mainfrom
fix/csv-low-memory-default
Jun 18, 2026
Merged

Default low_memory=False when reading CSVs with the C parser#322
lewisjared merged 2 commits into
mainfrom
fix/csv-low-memory-default

Conversation

@lewisjared

@lewisjared lewisjared commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Description

ScmRun(<csv>) reads via pandas.read_csv with pandas' default low_memory=True, which infers each column's dtype per chunk. For mostly-null or mixed-type metadata columns this is non-deterministic across runs/machines/pandas versions and emits DtypeWarning: Columns (N) have mixed types — the same file can load as object on one machine and float64 on another, which breaks downstream equality and regression checks.

This defaults low_memory=False in _read_pandas so dtype inference is a single, deterministic, quiet pass. It is scoped to the C parser only — the python and pyarrow engines reject low_memory (raise ValueError) — and an explicit caller-provided low_memory still wins.

Trade-off

low_memory=False raises peak memory during the parse, because the C parser buffers the whole file's tokenised data to infer each column's dtype in one pass instead of chunk-by-chunk. Measured on a 233 MB wide timeseries CSV (60k rows × 206 cols, isolated process, peak RSS): 436 MB with low_memory=True vs 995 MB with low_memory=False — roughly 2× peak and ≈2.4× the file size, while the resulting DataFrame is identical (114 MB) either way. The cost is confined to the CSV read path (low_memory only affects read_csv) and scales with file size, so it is negligible for small CSVs; anyone reading a very large CSV can pass low_memory=True explicitly to opt back into the chunked path.

Please confirm that this pull request has done the following:

  • Tests added
  • Documentation added (where applicable)
  • Example added (either to an existing notebook or as a new notebook, where applicable)
  • Changelog in '/changelog' added

pandas' default low_memory=True reads a CSV in chunks and infers each
column's dtype per chunk.
For mostly-null or mixed-type metadata columns this is non-deterministic
across runs and emits a DtypeWarning ("Columns (N) have mixed types"),
so the same file can load as object on one machine and float64 on another.

Default low_memory=False so dtype inference is a single, deterministic,
quiet pass.
The option is only valid for the C parser, so it is left untouched for the
python and pyarrow engines (which reject it), and an explicit caller value
still wins.
@lewisjared lewisjared merged commit d1552df into main Jun 18, 2026
12 checks passed
@lewisjared lewisjared deleted the fix/csv-low-memory-default branch June 18, 2026 23:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant