parquet/compress: enable WithAllLitEntropyCompression(true) for zstd by varun0630 · Pull Request #779 · apache/arrow-go

varun0630 · 2026-04-23T23:48:47Z

Rationale

The klauspost/compress/zstd encoder currently disables AllLitEntropyCompression at SpeedDefault (the preset that maps to zstd levels 1–4). Klauspost's encoder short-circuits to storing literals uncompressed when no LZ matches are found, skipping the entropy-coding stage. This is a good tradeoff for genuinely incompressible data (random bytes), but it leaves significant compression on the table for real-world columnar data where LZ match density is low but byte distributions are highly skewed — e.g. parquet INT32 decimal columns whose values cluster in a small range (so the high bytes are mostly zero).

Enabling WithAllLitEntropyCompression(true) forces entropy coding on literals even without LZ matches, matching the behavior of the C reference implementation (facebook/zstd) at the same nominal levels.

Impact

Measured on a real-world parquet workload — TPC-DS store_sales, 7 Trino-written files, ~9.5M rows, 23 columns including high-cardinality Decimal(7,2) columns — going through Apache Iceberg's compaction path at ZSTD level 3:

Config	Output vs input
klauspost (current default)	+6.11% inflation
klauspost + WithAllLitEntropyCompression(true)	-1.84% reduction
DataDog/zstd (CGo wrapper around C zstd) level 3	-2.23% reduction
Trino (JNI, C zstd level 3) — reference	-3.99% reduction

Per-blob benchmark (161 page blobs compressed directly by both implementations at level 3):

klauspost current default: 346,287 KB (66.60% of raw)
klauspost + this fix: 329,249 KB (63.32% of raw)
DataDog/zstd: 329,648 KB (63.40% of raw)

With this one-line change, klauspost matches (and slightly beats) the C reference implementation on this workload.

Discussing with @klauspost we concluded that enabling AllLitEntropyCompression is the intended way to close this gap. This PR applies that setting to arrow-go's zstd codec.

Trade-off

Slightly slower compression on genuinely incompressible data (the case AllLitEntropyCompression was disabled for). For parquet workloads, this is typically a non-issue since columns with no structure are rare.

klauspost's zstd encoder disables entropy compression on literals by default at SpeedDefault, skipping it when no LZ matches are found. This is an optimization for incompressible data, but it leaves significant compression on the table for parquet column data which often has skewed byte distributions even without obvious LZ matches (e.g. high bytes of INT32 decimal values clustered around zero). Enabling WithAllLitEntropyCompression(true) closes the gap vs C zstd at the same nominal level. On TPC-DS store_sales (9.5M rows, 23 cols), this takes the compaction output from +6.11% inflation to -1.84% reduction — essentially matching C zstd's -2.23% without CGo.

varun0630 · 2026-04-23T23:50:47Z

cc @klauspost — this is the arrow-go side of the discussion we had about WithAllLitEntropyCompression(true). Would appreciate your sign-off on the rationale.

klauspost · 2026-04-24T08:37:07Z

Yes. A bit of background: In most cases no LZ matches means the input is just random data. In those cases the data is just stored as raw data directly on fastest and default settings without checking if it can be entropy coded.

Speed impact will be minimal. Incompressible data is rejected at ~5 GB/s with the current implementation. Compressible data is done somewhere between 500-900MB/s, but ofc course also benefits from the compression.

varun0630 requested a review from zeroshade as a code owner April 23, 2026 23:48

zeroshade merged commit 7b3d772 into apache:main Apr 24, 2026
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet/compress: enable WithAllLitEntropyCompression(true) for zstd#779

parquet/compress: enable WithAllLitEntropyCompression(true) for zstd#779
zeroshade merged 1 commit intoapache:mainfrom
varun0630:upstream/klauspost-alllit-entropy

varun0630 commented Apr 23, 2026 •

edited

Loading

Uh oh!

varun0630 commented Apr 23, 2026

Uh oh!

klauspost commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

varun0630 commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale

Impact

Trade-off

Uh oh!

varun0630 commented Apr 23, 2026

Uh oh!

klauspost commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

varun0630 commented Apr 23, 2026 •

edited

Loading