Fakultät für Informatik TU München - Fakultät für Informatik
Lehrstuhl III: Datenbanksysteme
Technische Universität München
Home  |  Personen  |  Forschung  |  Lehre  |  Sonstiges  | 

The AutoGlobe Project - Experimental Results

AutoGlobe addresses two issues concerning the well-balanced organization of IT-infrastructures, as there are surveillance of service execution and the computation of allocation designs. The benchmark results of the former aspect are shown in the following.

Performance Evaluation

Description of the benchmark environment. We performed comprehensive benchmarks that model a realistic ERP installation. The tests documented the applicability of the fuzzy controller for automatically reacting on resource requirements when the number of users to be served increases.

Architecture of the benchmarkk environment
Figure 1. Benchmark System-Architecture

Figure 1 illustrates the architecture of our simulated ERP installation, which is – like, e.g., an SAP based ERP system – divided into a database layer, an application server layer, and a presentation layer. End-users communicate with the ERP installation using clients in the presentation layer. The end-users' clients themselves do not affect the system, thus we only simulate the number of users connected to services of our simulated ERP installation. The simulated services and servers are described using our declarative XML language, just like real existing services and servers. The installation comprises threee subsystems in the application and database layer: Classical Enterprise Resource Planning (ERP), Business Warehouse (BW), and Customer Relationship Management (CRM), each supplied with its own dedicated database and central instance (CI). The central instance applications are responsible for the global lock management of their particular subsystem. The other application servers (BW, CRM, FI, HR, LES, PP) execute the application logic, i.e., process user requests. Our controller supervises these application servers, databases, and central instances.

In a real system, there is a great deal of communication between the individual services. In our benchmark environment, we neglect communication costs because we assume a local high-bandwidth network connection. This is realistic in blade server environments which are normally equipped with Gigabit Ethernet or Infiniband.

Our system simulates a varying number of users generating requests. As observed in running SAP installations, the course of a request is simulated as follows. First, a request increases the load of the affected service host for a short period. Before handling the request in the database, the lock management of the central instance (CI) is requested. Therefore, the load drops on the application server and increases on the central instance. In case of a positive check the request is passed to the database. Thus, load drops on the central instance and increases on the database for the processing time. Finally, the database sends the answer back to the application server. Thus, for a short period, the load drops on the database and rises on the application server. Since the load caused by a single request depends on the specific service, e.g., an FI request produces lower load than a BW request, our benchmark system uses service-specific parameters to simulate the impact of requests.

Load curves of LES and BW
Figure 2. Qualitative load Curve of LES and BW

In addition to the load produced by user requests, every application server itself induces a basic load. The load curves generated by the simulated services follow predetermined patterns that can be observed in many companies running SAP software. Exemplary load curves for an LES and an BW application are shown in Figure 2, illustrating that the BW is mainly working during the night, while the LES's peaks are in the morning, before lunch and in the evening.

Hardware and initial deployment
Figure 3. Simulated Hardware and Initial Deployment

We assume a hardware environment that is scaled for peak load as that is quite common in today's computing centers. A standard single processor blade in our benchmark (performance index = 1) is dimensioned to handle at most 150 users of one service. The CPU load of the blades is between 60% and 80% during main activity in order to retain reserves for unpredictable load bursts. Figure 3 shows the simulated hardware and the initial deployment of the services. The simulated servers are:

ServiceNumber of UsersNumber of Instances
BW602
CRM3001
FI6003
HR3001
LES9004
PP4502
Table 1. Initial Number of Users

The performance index values stated are based on estimations and do not necessarily reflect the true performance of the servers. Table 1 shows the number of users per service and the number of instances that are started initially. These numbers are reasonable for a medium-sized company running an SAP system, e.g., most departments use the LES application servers while only the staff department uses the HR application servers.

Every benchmark starts with the same reasonable initial deployment of the services shown in Figure 3. We run different benchmark series and continually increase the number of users by 5% until the system becomes overloaded. The BW is an exception because it processes batch jobs instead of interactive requests. Thus, we increase the load per batch job by 5% and leave the number of jobs constant.

The benchmark includes three different scenarios, each assuming a different flexibility of services:

Today, the movement of services as well as the dynamic redistribution of users are only supported by few services because services must explicitly assist the movement or redistribution. Movement requires that the service be able to store its internal state before it is stopped, and that the newly started instance can restore the old state. Furthermore, it must be guaranteed that the users can be reconnected automatically to the newly instantiated service instance. Dynamic redistribution requires that the service be able to move parts of its state to another instance. In the future, we expect that more services will support dynamic relocation and redistribution and thus consider them in the full mobility scenario.

To prevent the system from reacting too late, we set the controller's threshold value for a CPU overload to 70%, i.e., if a server has more than 70% CPU load it is considered overloaded. In this case, the controller monitors the load values of the service for 10 minutes (watchTime) in order to prevent the system from over-reacting on short load bursts. After execution of an action, the affected services are protected for 30 minutes and affected servers are protected for 60 minutes. The threshold value for an idle situation depends on the performance index of the server and is 12.5% / performance_index. An idle situation is recognized after a watchTime of 20 minutes.

All benchmark runs are carried out in 40-fold acceleration and are simulating a system for 80 hours. The shown time intervals correspond to simulated time.

Benchmark results. Every benchmark starts with the same reasonable initial deployment of the services. Figures 4, 5, and 6 show benchmark results with the number of users increased by 15% compared to the user numbers shown in Table 1. This demonstrates how the ERP installation handles an increasing number of users. The figures show the load curves of all servers and the average load of the whole system, indicated by a thick line.

CPU load of all servers in the static scenario involved servers
Figure 4. CPU Load of all Servers (Static Scenario)

In the static scenario, several servers become overloaded, i.e., have a CPU load of more than 80% for a long time (the controller considers a server already overloaded if it has more than 70% CPU load to prevent the system from reacting too late. Actually, we consider a server overloaded if it has more than 80% CPU load for a long time.), at regular intervals, thus a non-adaptive computing environment cannot handle this situation satisfactorily. If a host running an interactive service is overloaded, the service requires more time to process the requests and, therefore, delays new requests. Thus, users cannot perform all their requests in a given period, e.g., a working day, and requests will be delayed until the next day. If a BW application server is overloaded, the batch jobs require more time. Thus, they may become conflicted with other services and compete against them for resources.

CPU load of all servers in the constrained mobility scenario involved servers
Figure 5. CPU Load of all Servers (Constrained Mobility Scenario)

The situation already improves in the constrained mobility scenario. The controller reacts on arising overload situations by automatically starting additional instances of services. Because the users are not dynamically redistributed after a scale-out has taken place, the original servers remain quite loaded for a while. Due to user fluctuations, the load of the initially overloaded services slowly decreases. Altogether, the overload situations are on average shorter than in the static scenario, but due to the restrictions of the static user distribution, the overload situations cannot be prevented completely.

CPU load of all servers with full mobility involved servers
Figure 6. CPU Load of all Servers (Full Mobility Scenario)

In the full mobility scenario, the results are much better than in the constrained mobility scenario. Idle resources are efficiently used to relieve the load on heavily used resources. Thus, the utilization of the hardware is well-balanced. Due to the dynamic redistribution of users across all service instances, the effects of controller actions are observable almost instantly. Another advantage of the full mobility scenario is that the controller can react more flexibly on overload situations. The remaining short overload peaks at the beginning stem from the watchTime. If the instances of a service become overloaded, the controller monitors the instances for 10 minutes before starting a new instance. Therefore, for a short time, the existing instances stay overloaded. After the first day, there are normally more instances of every application server running than in the beginning. Thus, if the controller does not stop too many instances, the load can be distributed across a sufficient number of instances, and overload situations can be avoided.

In order to demonstrate the behavior of our controller in more detail, we present the FI application servers' load curves of the above described benchmarks.

CPU load of all FI instances in the static scenario
Figure 7. CPU Load of the FI Instances (Static Scenario)

Figure 7 shows the load curve of the FI application servers in the static scenario. There are three instances running on Belinda, Brewers, and Redsox. As services are static, the controller cannot remedy the overload situations. Thus, the service instances running on the less powerful blades become overloaded periodically. If a service or a server is overloaded, it can no longer be used in a reasonable way because the processing of mission critical OLTP-requests is slowed down.

CPU load of the FI instances in the constrained mobility scenario
Figure 8. CPU Load of the FI Instances (Constrained Mobility Scenario)

Figure 8 shows load curves in the constrained mobility scenario. When the employees begin to work, the instances on Belinda and Brewers become overloaded. The controller's reaction is to start an additional instance on Dagobert ("Out Dagobert"). Since users are not redistributed dynamically, the load of Belinda and Brewers only decreases slowly. These two hosts are still overloaded after the protection time, thus the controller starts another instance on Leia ("Out Leia"). Because these actions do not remedy the overload on Brewers fast enough, the controller decides to stop the instance running on Brewers ("In Brewers") to protect the host from a continuous overload situation. This FI instance is started again ("Out Brewers") after a short period of time due to an overload situation on Dagobert. Another FI instance is started on Braves ("Out Braves"). Further on, the controller starts new FI instances as required and stops instances running on overloaded blades and idle instances. During the second day, the controller needs only to execute one scale-in action because the FI instances running on Belinda, Brewers, and Leia can handle the load. The FI instance on Redsox is stopped ("In Redsox") because Redsox is additionally running a CRM instance and, thus, is overloaded. The FI instance running on Leia is stopped ("In Leia") in the night because the database of the BW subsystem uses the resources of Leia heavily. Thus, at the beginning of the third day, the remaining FI instances become overloaded. To remedy this overload situation, the controller starts new FI instances as required. In summary, the controller can avert most imminent overload situations from the FI. The remaining overload situation periods are short.

CPU load of the FI instances with full mobility
Figure 9. CPU Load of the FI Instances(Full Mobility Scenario)

Figure 9 shows load curves in the full mobility scenario. Again, the controller adds and stops instances as required. Additionally, service instances are moved from heavy loaded servers to other servers. In this scenario, users are dynamically redistributed, thus the effects of controller actions are observable instantly and overload situation can be averted completely.

Summary of Benchmarks We ran benchmark series for the three scenarios and each time increased the number of users by 5% until the system became overloaded, i.e., one or more servers had a CPU load of more than 80% for a long time. Table 4 shows the maximum numbers of users that can be handled by the existing hardware in the different scenarios. The values are relative to the number of users stated in Table 1.

ScenarioNumber of Users
static100%
constrained mobility115%
full mobility135%
Table 4. Maximum Possible, Relative Number of Users

In the static scenario, the hardware is sized for the initial number of users. Thus, if we increase the number of users by 5%, some servers immediately become overloaded. Using our controller in the constrained mobility scenario, the ERP installation can handle 15% more users because otherwise idle resources are used to remedy overload situations. Due to the restrictions of the static user distribution and of the available actions, idle resources cannot be used as efficiently as in the full mobility scenario. Nevertheless, our controller already works quite well for the constrained mobility scenario. In the full mobility scenario, our controller can push the number of users that can be handled by the ERP installation to 135% compared to the static scenario. The number of users is higher than in the constrained mobility scenario because idle resources can be used more efficiently.

The conclusion of our studies is that our controller can improve the capability of current IT infrastructures if static services like databases and central instances are deployed well. Additional degrees of freedom and dynamic user redistribution result in much more effective controller actions and, thus, a higher number of users that can be handled by the available hardware.