|
TU München - Fakultät für
Informatik
Lehrstuhl III: Datenbanksysteme
|
|
The Window Algorithm
Introduction
The Window Algorithm was developed to accommodate the demand for a method to
calculate the exact result of the BestmatchJoin on streaming data. Thus, the
two main characteristics of the Window Algorithm have already been mentioned:
- It works on streaming data.
- It calculates the exact result of the BestmatchJoin.
The second characteristic distincts the Window Algorithm from approximative
algorithms for calculating the BestmatchJoin. Whereas approximative methods
in general do not generate all correct result tuples and can furthermore
generate "false" tuples, i.e. tuples that are not correct result tuples,
the Window Algorithm generates all correct result tuples of the BestmatchJoin
and only those.
How can this be achieved? Of course, to be able to calculate the exact result
of the BestmatchJoin on streaming data, certain preconditions and constraints
must hold. First of all, the streaming input data must be sorted in ascending
order in one dimension (i.e. in one attribute). Secondly, only those pairs of
tuples are accepted in the join result whose join attribute values have a
distance of no more than a given epsilon value from each other. This is why the
calculated BestmatchJoin is called "constrained BestmatchJoin". The
epsilon values can be different for each join dimension (i.e. pair of
corresponding join attributes from the two input streams).
We shall limit our considerations to the constrained left-outer
BestmatchJoin (LOBMJE) here. Other variants of the join work similarly.
Description of the Algorithm
Simply put, the Window Algorithm works like this: At the beginning, two data
windows are initialized, one for each input stream. Both data windows have
identical upper- and lowerbounds. Initially, they are set to zero and to the
epsilon value of the sorted dimension respectively. Then the windows are
filled with all input tuples that fit into them, that means, whose sorted
attribute values lie within the window boundaries. Then the contents of the
two data windows are joined using any appropriate join implementation and
partial order for determining if one pair of tuples dominates another.
Hereby generated pairs of potential result tuples are written to a temporary
buffer. This must be done because at a later stage, better pairs might be
generated and the before generated pair must then be removed from the buffer.
Hence, potential result tuples cannot be written to the join output immediately
but must be buffered.
After that, the data windows are updated like this: All tuples that lie on
the lowerbound of the left data window are removed from the window. Then, the
window boundaries of both data windows are updated following one of four
possible cases:
- If the left data window is empty and the left input stream has no more
tuples, the remaining tuples in the temporary buffer are written to the
output and the join computation ends.
- If there are tuples remaining in the left data window and also in the left
input stream, the lowerbound of the data windows is set to the minimum of
the smallest tuple in the window (in terms of the value of the sorted
attribute) and the value of the sorted attribute of the next tuple from
the input stream minus the epsilon value of the sorted dimension. The
upperbound is set to the lowerbound plus the epsilon value of the sorted
dimension.
- If there are tuples remaining in the left data window but not in the left
input stream, the lowerbound is set to the smallest value of the sorted
attribute of the tuples in the window. The upperbound is again set to the
lowerbound plus the according epsilon value.
- If there are no tuples remaining in the left data window but there still
are tuples in the left input stream, then the upperbound is set to the
value of the sorted attribute of the next tuple from the input stream. The
lowerbound is set to the upperbound minus the according epsilon
value.
Now, all tuples in the right data window that lie below the new lowerbound
are removed. Also, all pairs of potential result tuples in the temporary buffer
whose left tuples lie below the new lowerbound (i.e. whose values of the sorted
attribute are smaller than the new lowerbound) are removed from the buffer and
written to the output as they cannot be dominated by a better pair anymore (no
more pair will be generated in the according epsilon environment).
Then the data windows are filled again with all tuples from the input streams
that fit into the windows with the new boundaries. Notice that this works fine
and the data windows always move up along the sorted dimension because the
tuples in the input streams are sorted in ascending order in that dimension.
In the last of the above cases, the contents of the two data windows are then
again joined at the beginning. In all other cases however, only parts of the
data windows are joined to avoid the generation of duplicates. Here, only the
"old" part of the left data window (the tuples in the window that already were
in the window prior to the last update) is joined with the "new" part of the
right data window (the tuples that were written into the window during the
last update step). Furthermore, the "new" part of the left data window is
joined with the whole right data window ("old" and "new" part together). This is
shown in the pictures below.
All this loops until the first of the above four cases occurs and the join
computation ends having generated all correct result tuples of the
constrained left-outer BestmatchJoin (and only those).
Figure 1: Join computation for the constrained left-outer BestmatchJoin
People
Lehrstuhl für Datenbanksysteme
Letzte Änderung:
25.05.2005 um 14:37:47