Default low_memory=False when reading CSVs with the C parser by lewisjared · Pull Request #322 · openscm/scmdata

lewisjared · 2026-06-18T23:22:28Z

Description

ScmRun(<csv>) reads via pandas.read_csv with pandas' default low_memory=True, which infers each column's dtype per chunk. For mostly-null or mixed-type metadata columns this is non-deterministic across runs/machines/pandas versions and emits DtypeWarning: Columns (N) have mixed types — the same file can load as object on one machine and float64 on another, which breaks downstream equality and regression checks.

This defaults low_memory=False in _read_pandas so dtype inference is a single, deterministic, quiet pass. It is scoped to the C parser only — the python and pyarrow engines reject low_memory (raise ValueError) — and an explicit caller-provided low_memory still wins.

Trade-off

low_memory=False raises peak memory during the parse, because the C parser buffers the whole file's tokenised data to infer each column's dtype in one pass instead of chunk-by-chunk. Measured on a 233 MB wide timeseries CSV (60k rows × 206 cols, isolated process, peak RSS): 436 MB with low_memory=True vs 995 MB with low_memory=False — roughly 2× peak and ≈2.4× the file size, while the resulting DataFrame is identical (114 MB) either way. The cost is confined to the CSV read path (low_memory only affects read_csv) and scales with file size, so it is negligible for small CSVs; anyone reading a very large CSV can pass low_memory=True explicitly to opt back into the chunked path.

Please confirm that this pull request has done the following:

Tests added
Documentation added (where applicable)
Example added (either to an existing notebook or as a new notebook, where applicable)
Changelog in '/changelog' added

pandas' default low_memory=True reads a CSV in chunks and infers each column's dtype per chunk. For mostly-null or mixed-type metadata columns this is non-deterministic across runs and emits a DtypeWarning ("Columns (N) have mixed types"), so the same file can load as object on one machine and float64 on another. Default low_memory=False so dtype inference is a single, deterministic, quiet pass. The option is only valid for the C parser, so it is left untouched for the python and pyarrow engines (which reject it), and an explicit caller value still wins.

lewisjared added 2 commits June 19, 2026 09:21

Add changelog fragment for PR #322

258fd3f

lewisjared merged commit d1552df into main Jun 18, 2026
12 checks passed

lewisjared deleted the fix/csv-low-memory-default branch June 18, 2026 23:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default low_memory=False when reading CSVs with the C parser#322

Default low_memory=False when reading CSVs with the C parser#322
lewisjared merged 2 commits into
mainfrom
fix/csv-low-memory-default

lewisjared commented Jun 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lewisjared commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Trade-off

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lewisjared commented Jun 18, 2026 •

edited

Loading