Join Operators#
A single I/O in the operator model includes:
Accessing the page or record (whichever highest)
Exclude:
Predicate comparison
Syntax#
\([L]\): Number of pages to store \(R\)
\(p_R\): Number of records per page of \(R\)
\(|R|\): The number of records in \(R\)
\(|R| = p_R [L]\)
Theta Join#
Simple Nested Loops Join
for r in R.records: for s in S.records: <pred>
Consider joining two table with records labeled \(R\) and \(S\) respectively with an on clause. The cost is,
\[ [L] + |R|[R] \]Order of Join
Take note of the order you join. It may be cheaper doing \(S\) join \(R\) than \(R\) join \(S\).
Page Nested Loop Join
for r in R.pages: for s in S.pages: for records_r in r: for records_s in s: <pred> <pred>
Instead of scanning each record of \(R\) to a page of \(S\) we yse page of \(R\)
\[ [L] + [L][R] \]Chunk Nested Loop Join
We will chunk the records into \(B-2\) pages.
\[ [L] + \left\lceil{\frac{[L]}{B-2}}\right\rceil \cdot [R] \]Index Nested Loop Join
Unclustered $\( [L] + |R| * (\texttt{Search} + \texttt{HeapAccess}) \)$
Clustered $\( [L] + |R| * \texttt{Search} + \texttt{HeapAccess} \cdot \texttt{distinct_vals(R)} \)$
Sort-Merge Join#
A sort-merge join is a join using lexographical ranges which has an equality join with less cost than purely doing a cartesian join.
if (not mark):
while (r<s): r++
while (r>s): s++
mark = s
if (r==s):
result = (r,s)
s++
return result
else:
s = mark
r++
mark = NULL
The cost using \(B = \sqrt{\max{[L],[R]}}\) is,
Improvement#
We can increase the buffer to \(B = \sqrt{[L]} + \sqrt{[R]}\) and first sort both \(R\) and \(S\) simultaneously then at the last pass merge while finding pairs. The cost for this is,
Hash Join#
We wish to join two tables \(R\) and \(S\) by comparing their hash.
Naive Hash Join#
Assuming \(B < [L]\), we can create a big enough buffer to compare the hashes between \(R\) and \(S\) by creating a hash table of size \(B-2\) (the remaining 2 is taken up by storing \(R\) and \(S\)).
Cost
\[[L] + [R]\]Buffer Management $\(B\)$
Grace Hash Join#
Using external hashing, for every pass we can take the partition and do Build & Probe which is a pass that uses naive hash join on the partition
For each partiton:
Load pages of \(R\) into the input buffer
Store input buffer to hash table
Repeat steps 1-2 until no more \(R\) pages
Store pages of \(S\) into the input buffer
Search the hash table for match. Write matches to output
Continue until no more \(S\). Remember to flush out the output if full.
Cost: $\( 3([L] + [R])\)$
Buffer (2-Pass) $\( [L] = (B-1)(B-2) \)$