Title
|
Optimizer
The goal of the optimizer is to find a good plan to execute a query,
if a plan exists. The ``if a plan exists'' part is important because
the ObjectGlobe optimizer, other than a traditional optimizer, might
sometimes fail to find a plan, even if the parser was able to discover
relevant resources. First of all, limitations due to authorizations
can make it impossible to find a valid plan; for instance, it might
happen that two collections cannot be joined because there is no cycle
provider that has permission to see both collections. Furthermore,
ObjectGlobe users and applications can limit the total cost (in $)
and response time of a query; this is an idea adopted from
Mariposa [SAL+96].
The optimizer enumerates alternative plans using a System-R
style dynamic programming algorithm. That is, the optimizer builds
plans in a bottom-up way: first so-called access plans are
constructed that specify how each collection is read (i.e., at which
cycle provider and with which scan or wrapper operator). After that, join plans are constructed from
these access plans and (later) from simpler join plans.
To deal with unary external functions and predicates, the dynamic
programming algorithm is extended as described
in [CS96]. In every step, the cost of each plan is
estimated and inferior plans are pruned in order to speed up the
optimization process. Rather than presenting the full details of the
ObjectGlobe optimizer, we would like to highlight the peculiarities
that make the ObjectGlobe optimizer special:
Compatibility Matrix
During query optimization every plan
is annotated (among others) with a compatibility matrix. The
compatibility matrix of an access plan is identical with the
compatibility matrix generated by the parser for the corresponding
collection. The matrix of a join plan which is composed of two
sub-plans is generated by ANDing the two compatibility matrices of the
two sub-plans, resulting in a more restrictive matrix.
Sanity Checks
Some sub-plans can be immediately discarded
during plan enumeration based on the sub-plan's compatibility matrix.
As an example, consider the following situation: collections R1and R2 belong to the same theme
and a query is
interested in
for some external function f. For
collection R1, f may only be executed by cycle provider x; for
collection R2, f may only be executed by cycle provider y. Now
a sub-plan
can immediately be discarded because there is
no way to execute f on top of
(neither x nor ywork); in other words, the
plan has no points set in
the f row of its compatibility matrix. (Note, however, that an
plan is valid, if it is equivalent.) If several
variants of f exist, then the
plan can be discarded
if there is no point set in the shelf of f rows. (A shelf
is a set of rows in the matrix for different variants of the same function.)
Obviously, a plan can also be immediately discarded if its estimated
cost or response time exceeds the specified limit.
We also carry out more sophisticated sanity checks at the beginning of
query optimization. For example, there must be at least one cycle
provider which has permission and is capable to execute the display operator for each collection. Typically, this must be
the client machine at which the query was issued. If such a cycle
provider does not exist, then no plan exists and the optimizer can
stop without enumerating any plans. In theory, such sanity checks
that span several compatibility matrices could be applied in order to
discard certain sub-plans during the plan enumeration process; since
these sanity checks are quite costly, however, they are only carried
out once, at the beginning before plan enumeration starts.
UNION Queries
As shown earlier, collections can be
horizontal partitions which need to be unioned and different
collections of the same theme can have different authorization
requirements (i.e., different compatibility matrices). As a result,
the optimizer must consider each collection individually, even
collections of the same theme which are not treated individually by
traditional optimizers. Considering every collection individually
involves extending the dynamic programming algorithm for plan
enumeration; essentially, the optimizer enumerates
in
the same way as a two-way join plan and
in the
same way as a three-way join plan, if
R1, R2, R3 belong to the
same theme. The ObjectGlobe optimizer would also consider plans like
as well as
for queries that involve these three collections.
Bypassing Data
Currently, the optimizer models
authorization information in a coarse grained way: either the whole
collection or nothing may be processed by a cycle provider or some
function. In many application environments, however, a finer-grained
authorization model is desirable; e.g., a cycle provider might be
allowed to see certain attributes of a collection and others not.
Such a finer-grained authorization model makes more plans possible.
One option is to encrypt those attributes that a cycle provider is not
allowed to see. Another option is to split the objects of a
collection and only ship those attributes to the cycle
provider which the cycle provider is allowed to see; the other
attributes are bypassed and later the objects are reassembled at a
cycle provider that has permission to see all attributes. In some
sense, such a plan works like a semi-join program. In order to
enumerate such plans, we have once more extended the
plan enumeration algorithm; details of this extension are described
in [BKKS99]. (As shown
in [BKKS99], such plans are also useful to get
better response times even if there are no authorization constraints.)
Costing
To estimate the cost (in $) and response time of
a plan the optimizer completely relies on the statistics and cost
functions registered in the lookup service. In the absence of such
statistics, the ObjectGlobe optimizer will guess, just as any
other optimizer. More work on costing in distributed and
heterogeneous query processors has been reported
in [ROH99]. Costing in the ObjectGlobe optimizer is
currently very simple, but we are planning to extend our costing
framework along the lines proposed in [ROH99].
IDP
Evidently, the search space can become too large for
full dynamic programming to work for complex ObjectGlobe queries. To
deal with such queries, we developed another extension that we call
iterative dynamic programming (IDP for short). IDP is adaptive; it starts like dynamic programming and if the query
is simple enough, then IDP behaves exactly like dynamic
programming. If the query turns out to be too complex, then IDP applies heuristics in order to find an acceptable plan.
Details and a complete analysis of IDP is given
in [KS98b].
Lehrstuhl für Datenbanksysteme
Letzte Änderung:
25.05.2005 um 14:34:37