Tối ưu hóa viễn thông và thích nghi Kỹ thuật Heuristic P13

Chia sẻ: Tien Van Van | Ngày: | Loại File: PDF | Số trang:11

Thêm vào BST

Báo xấu

98
lượt xem 6
download

Download Vui lòng tải xuống để xem tài liệu đầy đủ

As Internet connectivity is reaching the global community, information systems are becoming more and more distributed. Inevitably, this overnight exponential growth has also caused traffic overload at various places in the network. Until recently, it was believed that scaling the Internet was simply an issue of adding more resources, i.e. bandwidth and processing power could be brought to where they were needed.

Chủ đề:

Bình luận(0) Đăng nhập để gửi bình luận!

Lưu

Nội dung Text: Tối ưu hóa viễn thông và thích nghi Kỹ thuật Heuristic P13

Telecommunications Optimization: Heuristic and Adaptive Techniques. Edited by David W. Corne, Martin J. Oates, George D. Smith Copyright © 2000 John Wiley & Sons Ltd ISBNs: 0-471-98855-3 (Hardback); 0-470-84163X (Electronic) 13 Adaptive Demand-based Heuristics for Traffic Reduction in Distributed Information Systems George Bilchev and Sverrir Olafsson 13.1 Introduction As Internet connectivity is reaching the global community, information systems are becoming more and more distributed. Inevitably, this overnight exponential growth has also caused traffic overload at various places in the network. Until recently, it was believed that scaling the Internet was simply an issue of adding more resources, i.e. bandwidth and processing power could be brought to where they were needed. The Internet’s exponential growth, however, exposed this impression as a myth. Information access has not been and will not be evenly distributed. As it has been observed, user requests create ‘hot-spots’ of network load, with the same data transmitted over the same network links again and again. These hotspots are not static, but also move around, making it impossible to accurately predict the right network capacity to be installed. All these justify the requirement to develop new infrastructure for data dissemination on an ever-increasing scale, and the design of adaptive heuristics for traffic reduction. In this chapter, we develop a distributed file system model and use it as an experimental simulation tool to design, implement and test network adaptation algorithms. Section 13.2 describes in detail the distributed file system model and explains the implemented simulation environment. Two adaptation algorithms are developed in section 13.3. One is Telecommunications Optimization: Heuristic and Adaptive Techniques, edited by D.W. Corne, M.J. Oates and G.D. Smith © 2000 John Wiley & Sons, Ltd
224 Telecommunications Optimization: Heuristic and Adaptive Techniques based on the ‘greedy’ heuristic principle and the other is a genetic algorithm tailored to handle the constraints of our problem. Experiments are shown in section 13.4, and section 13.5 gives conclusions and discusses possible future research directions. 1(7:25. Figure 13.1 A schematic representation of the network and the distributed file system. 13.2 The Adaptation Problem of a Distributed File System The World Wide Web is rapidly moving us towards a distributed, interconnected information environment, in which an object will be accessed from multiple locations that may be geographically distributed worldwide. For example, a database of customers’ information can be accessed from the location where a salesmen is working for the day. In another example, an electronic document may be co-authored and edited by several users. In such distributed information environments, the replication of objects in the distributed system has crucial implications for system performance. The replication scheme affects the performance of the distributed system, since reading an object locally is faster and less costly than reading it from a remote server. In general, the optimal replication scheme of an object depends on the request pattern, i.e. the number of times users request the data. Presently, the replication scheme of a distributed database is established in a static fashion when the database is designed. The replication scheme remains fixed until the designer manually intervenes to change the number of replicas or their location. If the request pattern is fixed and known a priori, then this is a reasonable solution. However, in practice the request patterns are often dynamic and difficult to predict. Therefore, we need
Adaptive Demand-based Heuristics for Traffic Reduction in Distributed Information Systems 225 an adaptive network that manages to optimize itself as the pattern changes. We proceed with the development of a mathematical model of a distributed information/file system. A distributed file system consists of interconnected nodes where each node i, i = 1,N has a local disk with capacity d i to store files – see Figures 13.1 and 13.2. There is a collection of M files each of size s j , j = 1, M . Copies of the files can reside on any one of the disks provided there is enough capacity. The communication cost c i ,k between nodes i and k (measured as transferred bytes per simulated second) is also given. User Community NETWORK ... Database Server Figure 13.2 Users connect to each node from the distributed files system and generate requests. In our model each node runs a file manager which is responsible for downloading files from the network (Figure 13.3). To do that, each file manager i maintains an index vector lij containing the location where each file j is downloaded from. User applications running on the nodes generate file requests the frequency of which can be statistically monitored in a matrix {pi,j}. To account for contention and to distribute the file load across the network it has been decided to model how busy the file managers are at each node k as follows: ∑ pi, j ⋅ s j i, j li , j = k bk = ∑ pi, j ⋅ s j n =1, N m =1, M Thus, the response time of the file manager at node k can be expressed as waiting time in a buffer (Schwartz, 1987):
226 Telecommunications Optimization: Heuristic and Adaptive Techniques NODE File Manager Application 1 ……. Application Z Multi-tasking Figure 13.3 Each node runs a file manager (responsible for allocating files on the network) and a number of user applications which generate the requests.  1  τ k > bk rk = τ k − bk ∞  otherwise where τ k reflects the maximum response capacity of the individual servers. The overall performance at node i can be measured as the time during which applications wait for files to download (i.e. response time): M  sj  Oi = ∑  ci,l  + rli , j  ⋅ pi , j  j =1  i, j  The first term in the sum represents the time needed for the actual transfer of the data and the second term reflects the waiting time for that transfer to begin. The goal is to minimize the average network response time: N M  sj  min {li , j } 1 N ∑∑  ci,l  + rli , j  ⋅ p i , j  i =1 j =1  i, j  The minimization is over the index matrix {li,j}. There are two constraints: (1) the available disk capacity on each node should not be exceeded; and (2) each file must have at least one copy somewhere on the network.
Adaptive Demand-based Heuristics for Traffic Reduction in Distributed Information Systems 227 13.2.1 The Simulation Environment In our distributed file system model the users generate file requests and the network responds by allocating and downloading the necessary files. The file requests can be statistically monitored and future requests predicted from observed patterns. The simulation environment captures these ideas by modeling the user file requests as random walks: pi , j (t ) = pi , j (t − 1) + γ where γ is drawn from a uniform distribution U (−r , r ) . The parameter r determines the ‘randomness’ of the walk. If it is close to zero then pi , j (t ) ≈ pi , j (t − 1) . During the simulated interval [t, t+1] the model has information about the file requests that have occurred in the previous interval [t–1, t]. Thus the dynamics of the simulation can be formally defined as: For t=1,2,3,… generate new file requests: P(t ) = P(t − 1) + { i , j } γ simulate network: N N M  sj  O (t ) = 1 N ∑ Oi (t ) = 1 N ∑ ∑  c i ,l  + rli , j  ⋅ p i , j (t )  i =1 i =1 j =1  i, j  An adaptive distributed file system would optimize its file distribution according to the user requests. Since the future user requests are not known and can only be predicted the optimization algorithm would have to use an expected value of the requests derived from previous observations: ~ P (t ) = Prediction( P (t − 1)) Thus an adaptive distributed file system can be simulated as follows: For t=1,2,3,… file requests prediction: P (t ) = Prediction( P (t − 1)) optimization: L(t ) = Optimize( P (t )) generate new file requests: P(t ) = P(t − 1) + {γ i , j } simulate network: N N M  sj  O (t ) = 1 N ∑ Oi (t ) = 1 N ∑ ∑  c i ,l  + rli , j  ⋅ p i , j (t )  i =1 i =1 j =1  i, j  The next section describes the developed optimization algorithms in detail.
228 Telecommunications Optimization: Heuristic and Adaptive Techniques 13.3 Optimization Algorithms 13.3.1 Greedy Algorithm The ‘greedy’ principle consists of selfishly allocating resources (provided constraints allow it) without regard to the performance of the other members of the network (Cormen et al., 1990). While greedy algorithms are optimal for certain problems (e.g. the minimal spanning tree problem) in practice they often produce only near optimal solutions. Greedy algorithms, however, are very fast and are usually used as a heuristic method. The greedy approach seems very well suited to our problem since the uncertainties in the file request prediction mean that we never actually optimize the real problem, but our expectation of it. The implemented greedy algorithm works as follows: For each file j check every node i to see if there is enough space to accommodate it and if enough space is available calculate the response time of the network if file j was at node i: N  sj  ∑  ck ,i + ri  ⋅ p k , j     k =1 After all nodes are checked copy the file to the best found node. The above described algorithm loads only one copy of each file into the distributed file system. If multiple copies are allowed, then add copies of the files in the following way: For each node i get the most heavily used file (i.e., max ( p i , j ⋅ s j ) ) which is not already present. j Check if there is enough space to accommodate it. If yes, copy it. Continue until all files are checked. 13.3.2 Genetic Algorithm Genetic Algorithms (GAs) are very popular due to their simple idea and wide applicability (Holland, 1975; Goldberg, 1989). The simple GA is a population-based search in which the individuals (each representing a point from the search space) exchange information (i.e. reproduce) to move through the search space. The exchange of information is done through operators (such as mutation and crossover) and is based on the ‘survival of the fittest’ principle, i.e. better individuals have greater chance to reproduce. It is well established that in order to produce good results the basic GA must be tailored to the problem at hand by designing problem specific representation and operators. The
Adaptive Demand-based Heuristics for Traffic Reduction in Distributed Information Systems 229 overall flow of control in our implemented GA for the distributed files system model is similar to the steady state genetic algorithm described in Chapter 1. In order to describe the further implementation details of our GA we need to answer the following questions: How are the individuals represented? How is the population initialized? How is the selection process implemented? What operators are used? Individuals representation: each individual from the population represents a distribution state for the file system captured by the matrix {li , j } . Initialization: it is important to create a random population of feasible individuals. The implemented initialization process randomly generates a node index for each object and tries to accommodate it on that node. In case of a failure the process is repeated for the same object. Selection process: the individuals are first linearly ranked according to their fitness and then are selected by a roulette-wheel process using their rank value. Operators: the main problem is the design of operators which preserve feasibility of the solutions (Bilchev and Parmee, 1996). This is important for our problem since it intrinsically has two constraints: (i) the disk capacity of the nodes must not be exceeded; and (ii) each file must have at least one copy somewhere on the network. (If feasibility is not preserved by the operators, then the fitness function would require to be modified by an appropriate penalty function in order to drive the population into the feasible region.) We have developed two main operators both preserving feasibility: The new operators developed in this work are called Safe-add and Safe-delete. Safe-add works as follows: For each node randomly select and copy a file which is not already locally present and whose size is smaller than the available disk space. Check to see if any of the nodes would respond faster by downloading files from the new locations and if yes, update the matrix {li , j } . Safe-delete is as follows: For each node randomly select and delete a file provided it is not the last copy. Update the matrix {l i, j } to reflect on the above changes. In our experiments we have used a population size of 70 individuals for 30 generations. During each generation 50 safe-add and three safe-delete operators are applied. During the selection process the best individual has 5% more chances of being selected as compared to the second best, and so on.
230 Telecommunications Optimization: Heuristic and Adaptive Techniques 13.4 Simulation Results In our experiments we start with an offline simulation during which the optimization algorithms are run when the file system is not used (i.e. overnight, for example). In this scenario, we assume that both algorithms have enough time to finish their optimization before the file system is used again. A typical simulation is shown in Figure 13.4. Tests are done using seven nodes and 100 files. All simulation graphs start from the same initial state of the distributed file system. Then the two optimization algorithms are compared against a non-adaptive (static) file system (i.e. when no optimization is used). The experiments undoubtedly reveal that the adaptive distributed file system produces better results as compared to a static file system. The graphs also clearly indicate the excellent performance of the GA optimizer, which consistently outperforms the greedy algorithm. O(t) Static file Response time system Greedy algorithm Genetic algorithm t 1 50 100 Figure 13.4 An offline simulation. Adaptive distributed file systems utilizing a genetic algorithm and a greedy algorithm respectively are compared against a static distributed file system. The experiments use seven nodes and 31 files. To show the effect of delayed information, we run the greedy algorithm once using the usage pattern collected from the previous simulation step P(t–1) (which are available in practice) and once using the actual P(t) (which is not known in practice). The difference in performances reveals how much better we can do if perfect information were available (Figure 13.5).
Adaptive Demand-based Heuristics for Traffic Reduction in Distributed Information Systems 231 O(t) Response time Greedy algorithm Greedy algorithm with perfect information t 1 15 30 Figure 13.5 A greedy algorithm with perfect information is compared to a greedy algorithm with delayed information. In practice, we only have delayed information. Static file O(t) Genetic algorithm system Response time Greedy algorithm t 1 50 100 Figure 13.6 Online simulation. The circles indicate when the GA optimization takes place. The GA/greedy algorithm ratio is 60 (i.e. the GA is run once for every 60 runs of the greedy algorithm).
232 Telecommunications Optimization: Heuristic and Adaptive Techniques O(t) Response time Static file system Genetic algorithm Greedy algorithm t 1 50 100 Figure 13.7 Online simulation. The GA/greedy algorithm ratio is 40. This is the critical ratio where the average performance of both algorithm is comparable. O(t) Response time Static file system Greedy algorithm Genetic algorithm t 1 50 100 Figure 13.8 Online simulation. The GA/greedy algorithm ratio is 30. The GA manages to maintain its performance advantage.
Adaptive Demand-based Heuristics for Traffic Reduction in Distributed Information Systems 233 Next we consider online simulations. In an online simulation a faster optimization algorithm would be executed at a greater rate than a slower algorithm. To account for the fact that the GA is much slower than the greedy algorithm, we introduce a new simulation variable, namely the GA/greedy algorithm ratio. It shows how many greedy optimizations can be achieved during the time it takes for one GA optimization to finish. Figure 13.6 shows a typical online simulation. Here the GA is run every 60 steps while the greedy algorithm is run every step. The GA optimizes the file system at the beginning of the simulation. Then for the next 60 steps the file system is static. This allows the greedy algorithm to take over and outperform the GA. It is interesting to note the critical GA/greedy algorithm ratio after which the greedy algorithm outperforms the GA. For our simulation model, this happens to be around 40. Figure 13.7 shows such a simulation. When the ratio drops even further then the GA is capable of maintaining its performance advantage (Figure 13.8). The simulation results suggest that the GA is more suitable for situations where the file request pattern does not change very quickly, while the greedy algorithm is more appropriate for rapidly changing request patterns. 13.5 Conclusions and Future Work Our investigations showed that adaptation is crucial for network performance optimization. Its importance would be even more evident in a real network where there are limited resources and slow communication links. The simulation results confirmed our initial hypothesis that a tailored genetic algorithm would outperform ‘classical’ heuristics such as the greedy algorithm. However, while being able to find better solutions, the GA is considerably slower and doesn’t scale as well as the greedy algorithm. Saying that, it is important to mention that both algorithms are centralized. They run on a single node and use global information. Centralized algorithms in general suffer from scalability problems. Operations such as collecting global information can be prohibitive for large networks. This evidence suggests a direction for future studies, namely distributed adaptation algorithms where only local information is used and optimization decisions are executed locally on each node. While it is difficult to predict how global network behavior emerges out of the many local interactions, when designed such an algorithm would be intrinsically scaleable. Another important aspect requiring future studies is network resilience. User requests (file requests in our model) should not fail if a node is temporary disconnected. While designing resilient networks is a huge problem in itself, it is even more important to design resilient networks which are optimal and adaptable. Acknowledgements We would like to thank Brian Turton (University of Wales), Dave Corne (Reading University), Marc Wennink (BT) and Alan Pengelly (BT) for useful comments.