Fakultät für Informatik TU München - Fakultät für Informatik
Lehrstuhl III: Datenbanksysteme
Technische Universität München
Home  |  Personen  |  Forschung  |  Lehre  |  Sonstiges  | 



Optimizer

The goal of the optimizer is to find a good plan to execute a query, if a plan exists. The ``if a plan exists'' part is important because the ObjectGlobe optimizer, other than a traditional optimizer, might sometimes fail to find a plan, even if the parser was able to discover relevant resources. First of all, limitations due to authorizations can make it impossible to find a valid plan; for instance, it might happen that two collections cannot be joined because there is no cycle provider that has permission to see both collections. Furthermore, ObjectGlobe users and applications can limit the total cost (in $) and response time of a query; this is an idea adopted from Mariposa [SAL+96].

The optimizer enumerates alternative plans using a System-R style dynamic programming algorithm. That is, the optimizer builds plans in a bottom-up way: first so-called access plans are constructed that specify how each collection is read (i.e., at which cycle provider and with which scan or wrapper operator). After that, join plans are constructed from these access plans and (later) from simpler join plans. To deal with unary external functions and predicates, the dynamic programming algorithm is extended as described in [CS96]. In every step, the cost of each plan is estimated and inferior plans are pruned in order to speed up the optimization process. Rather than presenting the full details of the ObjectGlobe optimizer, we would like to highlight the peculiarities that make the ObjectGlobe optimizer special:

Compatibility Matrix

During query optimization every plan is annotated (among others) with a compatibility matrix. The compatibility matrix of an access plan is identical with the compatibility matrix generated by the parser for the corresponding collection. The matrix of a join plan which is composed of two sub-plans is generated by ANDing the two compatibility matrices of the two sub-plans, resulting in a more restrictive matrix.

Sanity Checks

Some sub-plans can be immediately discarded during plan enumeration based on the sub-plan's compatibility matrix. As an example, consider the following situation: collections R1and R2 belong to the same theme ${\cal R}$ and a query is interested in $f({\cal R})$ for some external function f. For collection R1, f may only be executed by cycle provider x; for collection R2, f may only be executed by cycle provider y. Now a sub-plan $R_1 \cup R_2$ can immediately be discarded because there is no way to execute f on top of $R_1 \cup R_2$ (neither x nor ywork); in other words, the $R_1 \cup R_2$ plan has no points set in the f row of its compatibility matrix. (Note, however, that an $f(R_1) \cup f(R_2)$ plan is valid, if it is equivalent.) If several variants of f exist, then the $R_1 \cup R_2$ plan can be discarded if there is no point set in the shelf of f rows. (A shelf is a set of rows in the matrix for different variants of the same function.) Obviously, a plan can also be immediately discarded if its estimated cost or response time exceeds the specified limit.

We also carry out more sophisticated sanity checks at the beginning of query optimization. For example, there must be at least one cycle provider which has permission and is capable to execute the display operator for each collection. Typically, this must be the client machine at which the query was issued. If such a cycle provider does not exist, then no plan exists and the optimizer can stop without enumerating any plans. In theory, such sanity checks that span several compatibility matrices could be applied in order to discard certain sub-plans during the plan enumeration process; since these sanity checks are quite costly, however, they are only carried out once, at the beginning before plan enumeration starts.

UNION Queries

As shown earlier, collections can be horizontal partitions which need to be unioned and different collections of the same theme can have different authorization requirements (i.e., different compatibility matrices). As a result, the optimizer must consider each collection individually, even collections of the same theme which are not treated individually by traditional optimizers. Considering every collection individually involves extending the dynamic programming algorithm for plan enumeration; essentially, the optimizer enumerates $R_1 \cup R_2$ in the same way as a two-way join plan and $R_1 \cup R_2 \cup R_3$ in the same way as a three-way join plan, if R1, R2, R3 belong to the same theme. The ObjectGlobe optimizer would also consider plans like $(R_1 \cup R_2) \Join S$ as well as $(R_1 \Join S) \cup (R_2 \Join S)$for queries that involve these three collections.

Bypassing Data

Currently, the optimizer models authorization information in a coarse grained way: either the whole collection or nothing may be processed by a cycle provider or some function. In many application environments, however, a finer-grained authorization model is desirable; e.g., a cycle provider might be allowed to see certain attributes of a collection and others not. Such a finer-grained authorization model makes more plans possible. One option is to encrypt those attributes that a cycle provider is not allowed to see. Another option is to split the objects of a collection and only ship those attributes to the cycle provider which the cycle provider is allowed to see; the other attributes are bypassed and later the objects are reassembled at a cycle provider that has permission to see all attributes. In some sense, such a plan works like a semi-join program. In order to enumerate such plans, we have once more extended the plan enumeration algorithm; details of this extension are described in [BKKS99]. (As shown in [BKKS99], such plans are also useful to get better response times even if there are no authorization constraints.)

Costing

To estimate the cost (in $) and response time of a plan the optimizer completely relies on the statistics and cost functions registered in the lookup service. In the absence of such statistics, the ObjectGlobe optimizer will guess, just as any other optimizer. More work on costing in distributed and heterogeneous query processors has been reported in [ROH99]. Costing in the ObjectGlobe optimizer is currently very simple, but we are planning to extend our costing framework along the lines proposed in [ROH99].

IDP

Evidently, the search space can become too large for full dynamic programming to work for complex ObjectGlobe queries. To deal with such queries, we developed another extension that we call iterative dynamic programming (IDP for short). IDP is adaptive; it starts like dynamic programming and if the query is simple enough, then IDP behaves exactly like dynamic programming. If the query turns out to be too complex, then IDP applies heuristics in order to find an acceptable plan. Details and a complete analysis of IDP is given in [KS98b].


Lehrstuhl für Datenbanksysteme
Letzte Änderung: 25.05.2005 um 14:34:37
Title