Skip to content

Broadcast hash join and sort-merge join can produce incorrect results on non-default collated keys [Spark 4] #4051

@andygrove

Description

@andygrove

Describe the bug

Follow-up from #4035, which rejects non-default collated strings in shuffle, sort, and aggregate but does not cover the join paths.

CometHashJoin.doConvert (used by both CometHashJoinExec and CometBroadcastHashJoinExec) and CometSortMergeJoinExec.supportedSortMergeJoinEqualType both accept any StringType as a join key without checking for non-default collation. Comet's native join performs byte-level key equality, so rows that compare equal under a collation (e.g., 'a' vs 'A' under utf8_lcase) will not match.

For sort-merge join this is masked today by the shuffle-side fallback added in #4035: the shuffle on a collated key falls back to Spark, which forces the whole stage to Spark. Broadcast hash join has no shuffle to protect it, so incorrect results can escape.

Steps to reproduce

With Spark 4.0.1:

SET spark.comet.exec.broadcastHashJoin.enabled = true;

CREATE OR REPLACE TEMP VIEW l AS SELECT * FROM (VALUES ('a'), ('B')) AS t(c);
CREATE OR REPLACE TEMP VIEW r AS SELECT * FROM (VALUES ('A'), ('b')) AS t(c);

SELECT l.c, r.c
FROM l JOIN r
ON l.c COLLATE utf8_lcase = r.c COLLATE utf8_lcase;

Spark returns two matching rows; Comet (with broadcast hash join enabled) returns zero because the byte-level equality in the native join does not match 'a' with 'A'.

Expected behavior

Comet should fall back to Spark for hash joins and sort-merge joins whose equality keys have a non-default collation, either by rejecting the key types in supportedSortMergeJoinEqualType / a new equivalent check in CometHashJoin.doConvert, or by funneling through the existing isStringCollationType predicate on CometTypeShim.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcorrectnesspriority:criticalData corruption, silent wrong results, security issuesspark 4

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions