parquet/compress: enable WithAllLitEntropyCompression(true) for zstd#779
Merged
zeroshade merged 1 commit intoapache:mainfrom Apr 24, 2026
Merged
Conversation
klauspost's zstd encoder disables entropy compression on literals by default at SpeedDefault, skipping it when no LZ matches are found. This is an optimization for incompressible data, but it leaves significant compression on the table for parquet column data which often has skewed byte distributions even without obvious LZ matches (e.g. high bytes of INT32 decimal values clustered around zero). Enabling WithAllLitEntropyCompression(true) closes the gap vs C zstd at the same nominal level. On TPC-DS store_sales (9.5M rows, 23 cols), this takes the compaction output from +6.11% inflation to -1.84% reduction — essentially matching C zstd's -2.23% without CGo.
Contributor
Author
|
cc @klauspost — this is the arrow-go side of the discussion we had about |
|
Yes. A bit of background: In most cases no LZ matches means the input is just random data. In those cases the data is just stored as raw data directly on fastest and default settings without checking if it can be entropy coded. Speed impact will be minimal. Incompressible data is rejected at ~5 GB/s with the current implementation. Compressible data is done somewhere between 500-900MB/s, but ofc course also benefits from the compression. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale
The
klauspost/compress/zstdencoder currently disablesAllLitEntropyCompressionatSpeedDefault(the preset that maps to zstd levels 1–4). Klauspost's encoder short-circuits to storing literals uncompressed when no LZ matches are found, skipping the entropy-coding stage. This is a good tradeoff for genuinely incompressible data (random bytes), but it leaves significant compression on the table for real-world columnar data where LZ match density is low but byte distributions are highly skewed — e.g. parquet INT32 decimal columns whose values cluster in a small range (so the high bytes are mostly zero).Enabling
WithAllLitEntropyCompression(true)forces entropy coding on literals even without LZ matches, matching the behavior of the C reference implementation (facebook/zstd) at the same nominal levels.Impact
Measured on a real-world parquet workload — TPC-DS
store_sales, 7 Trino-written files, ~9.5M rows, 23 columns including high-cardinalityDecimal(7,2)columns — going through Apache Iceberg's compaction path at ZSTD level 3:Per-blob benchmark (161 page blobs compressed directly by both implementations at level 3):
With this one-line change, klauspost matches (and slightly beats) the C reference implementation on this workload.
Discussing with @klauspost we concluded that enabling
AllLitEntropyCompressionis the intended way to close this gap. This PR applies that setting to arrow-go's zstd codec.Trade-off
Slightly slower compression on genuinely incompressible data (the case
AllLitEntropyCompressionwas disabled for). For parquet workloads, this is typically a non-issue since columns with no structure are rare.