Parallel Hashing and Sorting#

Hash Join#

We first distribute the workload with \(h_n\) and apply the divide phase. Then we continue with the next phase locally.

  • Two tuples that need to join always end up at the same machine

Sort-Merge Join#

  • Partition data by range at pass 0

    • To prevent data skew, we must determine the distribution. This can be done efficiently with random sampling.

  • Remaining pass are local.

Symmetric Join#

We can add symmetry to either hash or sort join to allow for a paralleled pipeline that’s always streaming. We do this by doing a local merge on both \(R\) and \(S\) as the input comes.