Security is a critical issue in open, extensible, and distributed systems. Incorrect code can intentionally or unintentionally falsify data, provide access to secret data, or impede the work of the system. Security can be improved by access control, scrutiny of the code before execution, and supervision of the program during execution.
OperatorCheck inspects external operators of ObjectGlobe. Their correspondence with an abstract formal specification is checked using test cases that are generated out of directives given by the tester. Additionally, the resource consumption of the operator during execution is measured and compared to a cost model also provided by the programmer.
As the correctness of a program depends on what exactly it is supposed to do, a complete and consistent specification is necessary. If testing should be processed automatically, a formal specification is required so that an interpreter can determine whether a calculated result is correct.
Several methods of formal specifications have been investigated. The best one seems to be using a functional language, because functional languages are Turing-complete and thus every operator can be specified that way, and because there are interpreters that can execute the specification and thus the correct result can be calculated. Haskell is a purely functional language. The Haskell interpreter's task is to reduce an expression to normal form. It tries to match the left side of an equation with a subexpression and replaces it with the right side.
Especially in database context, not only the correctness of the result is important but also the efficiency of its computation. Therefore, the specification of an external operator is augmented with several cost models for result size and resource consumption. Worst case cost models are used for detecting denial of service attacks by erroneous operators.
The tester provides a Java archive containing the external operator to be tested, a formal specification of the operator in Haskell or a cost model, depending on the kind of test to be performed, and some directives on how the test data should look like.
For a correctness test, the server generates test data based on the directives and stores them in appropriate formats on the hard disk. A Haskell program is built out of the formal specification, and the generic ObjectGlobe query plan is assembled. Now the test is performed: The Haskell interpreter and the ObjectGlobe system calculate their results. Afterwards, the results are loaded and compared, and the user receives the result of the test.
Moreover, it is possible to perform a reference test. Instead of providing a Haskell specification, the user can also nominate another ObjectGlobe operator that should serve as the oracle. The testing process works in an analogous way.
In a benchmark test, no oracle is consulted, but the test operator is executed several times using different sizes on input data. Instead of a formal specification, the user provides cost models for several resources. The consumption of these resources is measured and compared to the cost models. The test result shows the actually consumed resources and the maximum allowed by the cost models.
The starting page of the application displays a form where the user is asked for some basic information about the external operator.
| Directive | Effect |
|---|---|
| Cardinality(Rel,N) | sets the number of tuples in the relation |
| Cyclic(Rel,Attr,...) | generates values in increasing cyclic order |
| Unique(Rel,Attr,...) | generates unique random values |
| Uniform(Rel,Attr,...) | generates uniformly distributed random values |
| Normal(Rel,Attr,...) | generates normally distributed random values |
| ReferencesCyclic(Rel,Attr,Rel',Attr') | references in increasing cyclic order |
| ReferencesUnique(Rel,Attr,Rel',Attr') | references uniquely-randomly |
| ReferencesUniform(Rel,Attr,Rel',Attr') | references uniformly-randomly |
| Sort(Rel,Attr,asc/desc) | sorts a relation according to an attribute |
| Shuffle(Rel,Factor) | randomly shuffles a relation |
| AttrType | Parameters | Admissible Values |
|---|---|---|
| BOOLEAN | Rel,Attr | {true,false} |
| INT | Rel,Attr,Min,Max | {min,min+1,...,max} |
| INT | Rel,Attr,{x1,...,xn} | {x1,...,xn} |
| DOUBLE | Rel,Attr,Min,Max,Step | {min,min+step,...,max} |
| DOUBLE | Rel,Attr,{x1,...,xn} | {x1,...,xn} |
| STRING | Rel,Attr,Min,Max | {a1...an | ai<-{A,...,Z}, n<-{min,...,max}} |
| STRING | Rel,Attr,{x1,...,xn} | {x1,...,xn} |
For a correctness test, the application automatically generates the header and the footer of a Haskell program that defines the types of the relations, reads the test data from files, and calls the function simulating the operator. This function has to be filled in by the tester in the next form. Moreover, the tester can select the comparison mode (list, where order and count of the tuples matter; multiset, where order does not matter; set, where neither order nor count matters) and whether test data and results should be displayed in detail.
The form for the reference test is quite similar. Here, the user is asked
for the classname and the URL of the operator that should be used as a
reference implementation. Built-in operators can also serve for that task.
Then the field for the function provider has to be left empty. In the next
field, parameters for the reference operator can be given. Again the syntax
is
ParameterName = "ParameterValue".
At last, the comparison mode and the display detail can be selected.
In a benchmark test, not the correctness of an operator is checked but
its consumption of resources is measured. Furthermore, the behaviour, i.e.,
the stability and the performance degeneration, of the operator under heavy
load can be examined. Supervised resources are the number of tuples
produced, the maximum size of a produced tuple, primary and secondary
memory, and CPU usage. For each resource the user can specify a
worst-case cost model. If any resource is overconsumed, the operator is
considered faulty and aborted immediately in a real application.
Nevertheless, the cost models should not be chosen too generous, because
then a cycle provider might refuse to start it at all.
A cost model is a function of the extents of the n input relations:
nrTuples[k], maxTupleSize[k], and
totalDataSize[k], k<-{0,...,n-1}. Integers are allowed as
constants. Available operators are +, -, *,
/, and ^.
In the last field, the test runs can be designed. For each run, the sizes of
the input relations are specified in one line, separated by commas. The
number of lines determines the number of test runs.
For demonstration, three typical external operators have been implemented in ObjectGlobe and forms are already filled in with their formal specification. But of course, you can also start with an empty form.
The Skyline operator calculates the maxima in a partially ordered set of vectors, thus solving the so-called "maximum vector problem". A vector dominates another vector, if it is greater in all dimensions. A vector is a maximum of a given set, if it is not dominates by any other vector. All maxima are to be found. The operation is called "Skyline" because of the representation in a two-dimensional scenario.
DiagJoin is an example for a specialized join algorithm that can be used for joins over 1:N relationships in data warehouse applications. Related tuples are created nearly at the same time, hence they are located in neighbouring positions in their files. Therefore, the algorithm scans both relations simultaneously and looks for join partners in the current window.
Fagin's Ranking algorithm is especially relevant to distributed systems. It selects the top k objects rated by m data sources. Each data source assigns a value in [0,1] to each object, which symbolizes the relevance of this object to fuzzy queries between "not at all" and "completely". The overall rating of an object is the minimum of all ratings.
S. Börzsönyi: Quality Assurance for External Database Operators. Diploma thesis, University of Passau, 2001. (PDF)
S. Seltzsam, S. Börzsönyi, A. Kemper: Security in a Distributed and Extensible Query Engine. In preparation, 2001.
S. Börzsönyi, K. Stocker, D. Kossmann: The Skyline Operator. In Proceedings of the IEEE Conference on Data Engineering, Heidelberg, Germany, April 2001.
S. Helmer, T. Westmann, G. Moerkotte: DiagJoin: An Opportunistic Join Algorithm for 1:N Relationships. In Proceedings of the 24th VLDB Conference, New York, United States, August 1998.
R. Fagin: Combining Fuzzy Information from Multiple Systems. In Proceedings of the PODS '96 Conference, Montreal, Canada, 1996.