feat: add MapSort expression support for Spark 4.0#4076
feat: add MapSort expression support for Spark 4.0#4076andygrove wants to merge 10 commits intoapache:mainfrom
Conversation
Add native map_sort scalar function that sorts map entries by key in ascending order, and wire it up via the Spark 4.0 CometExprShim so that MapSort expressions are accelerated instead of falling back to Spark. Re-enable all CometColumnarShuffleSuite map tests that were skipped for Spark 4.0. Closes apache#1941 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Spark 4.0 normalizes shuffle keys containing array<map> via transform(arr, x -> mapsort(x)), which Comet does not yet support because ArrayTransform with a lambda body has no serde. Mark the columnar shuffle on map array element test as expecting the fallback on Spark 4.0+ while still verifying answer correctness.
The MapSort serde for Spark 4.0 called scalarFunctionExprToProto without a return type. The Rust planner then looked up "map_sort" in the session UDF registry to infer the type, but map_sort is only handled via the create_comet_physical_fun match dispatch, not registered as a UDF, causing "There is no UDF named 'map_sort' in the registry" at execution time (e.g., group-by on a map column in CollationSuite). Pass ms.dataType explicitly via scalarFunctionExprToProtoWithReturnType, matching the pattern used by ceil, floor, and other scalar functions.
Arrow's sort_to_indices does not support Struct (and other complex) key types, so map_sort fails at runtime when the map key is a struct. Check key type via supportedScalarSortElementType and fall back to Spark when the key type is not natively sortable. This fixes 4 CollationSuite failures in spark-sql-auto-sql_core-1 for Spark 4.0: 'Group by on map containing structs with ...'.
# Conflicts: # spark/src/main/spark-4.0/org/apache/comet/shims/CometExprShim.scala
Spark 4.0 wraps map shuffle keys in mapsort(...). Comet's map_sort relies on Arrow's sort_to_indices, which only supports scalar key types, so maps with array or struct keys fall back to Spark. Update the 'columnar shuffle on array/struct map key/value' test to expect 0 Comet shuffles for the array-key and struct-key cases on Spark 4.0+, while keeping the scalar-key cases at 1.
| .repartition(numPartitions, $"_1", $"_2") | ||
| .sortWithinPartitions($"_2") | ||
|
|
||
| if (isSpark40Plus) { |
There was a problem hiding this comment.
Nice, glad we're cleaning up this ugliness.
|
Thanks @andygrove! Categorized suggestions below: PerformanceOne global
|
- Use single global sort_to_indices+take instead of per-map take+concat - Add early-out fast paths (empty array, all-null, is_sorted=true) - Fall back to Spark for floating-point map keys when strictFloatingPoint=true - Clean up Arc::clone calls and replace .unwrap() on downcasts with .expect - Document MapSort behavior in map expressions compatibility guide
|
Thanks @mbutrovich. Pushed 1620e33b7 addressing the feedback: Performance
Spark compatibility
Style
Docs
I'll add benchmarks next |
Covers int and string keys at three map sizes (4, 16, 64 entries per map) with a fixed batch of 8192 maps.
|
Microbenchmark numbers from
The biggest wins are the small-map cases where the per-map struct copies and trailing |
mbutrovich
left a comment
There was a problem hiding this comment.
Looks awesome! Thanks for the quick turnaround on the feedback, @andygrove!
| case _ => None | ||
| } | ||
|
|
||
| case ms: MapSort => |
There was a problem hiding this comment.
is this shim only for 4.0 or need to be in 4.x?
There was a problem hiding this comment.
yeah the PR was started before 4.1 was added, and upmerging broke the tests. added shims for 4.1 and 4.2 now
There was a problem hiding this comment.
there is also a folder spark/src/main/spark-4.x to serve all Spark4 subversions
The MapSort handler was only added to the spark-4.0 shim, so under the spark-4.1 and spark-4.2 profiles `mapsort(...)` partitioning expressions fell through to `case _ => None` and the columnar shuffle reverted to plain Spark, breaking CometColumnarShuffleSuite.
Which issue does this PR close?
Closes #1941
Closes #3171
Rationale for this change
Spark 4.0 introduces
MapSort, used for normalizing map values when they appear in shuffle hash partitioning keys, intry_element_at, and in other contexts where map ordering must be deterministic. Without native support, queries that touch maps in any of these positions fall back to Spark, which forces the entire enclosing operator off Comet (e.g. an entire shuffle exchange).What changes are included in this PR?
map_sortinnative/spark-expr/src/map_funcs/map_sort.rsthat sorts map entries by key in ascending order, registered viacomet_scalar_funcs.rs.MapSortinto the Spark 4.0CometExprShimso the expression is converted to the new scalar function during serde.columnar shuffle on map array elementtest inCometColumnarShuffleSuitenow expects shuffle fallback on Spark 4.0+: the new shuffle-key normalization wrapsmapsortinsidetransform(arr, x -> mapsort(x)), and Comet does not currently supportArrayTransformwith a lambda body. Answer correctness is still verified viacheckSparkAnswer.How are these changes tested?
native/spark-expr/src/map_funcs/map_sort.rscover sorting on each supported key type, null handling, and empty maps.CometColumnarShuffleSuitetests for map shuffle keys all pass under the Spark 4.0 profile (41/41).