Customizing the minimum number of replicas for achieving fault tolerance in a cloud/grid environment

ABSTRACT


INTRODUCTION
The utilization of computing enables organizations to integrate geographically dispersed resources from different administrative regions into a unified system.This consolidation facilitates the resolution of large-scale problems across scientific, human, and social domains [1]- [5].These resources encompass diverse components such as computers, storage, peripherals, and applications.A middleware layer must be implemented to deliver essential services to users due to the heterogeneous and dynamically changing nature of network resources.
Network computing environments are prone to various failures and outages, primarily attributed to the abnormal characteristics of the network infrastructure.Failures can manifest in different forms [6], [7], including computer failures, connection issues, network bottlenecks, excessive power consumption, software malfunctions due to workload, potential software deletions, and other factors stemming from the diversity of networks and applications.Ensuring fault tolerance is crucial for maintaining continuous network operation and delivering reliable services.
In essence, reliable applications within the network should be capable of automatically mitigating failures with minimal losses, without significantly compromising performance and quality of service (QoS) ISSN: 2302-9285  Customizing the minimum number of replicas for achieving fault tolerance in … (Mahdi S. Almhanna) 397 [8], [9].Put simply, the network should possess the ability to minimize and overcome failures, allowing uninterrupted functionality.Fault tolerance in the network empowers it to continue operating even in the presence of significant mistakes or failures, without disrupting overall functionality.Also, handling failure may be through scheduling strategies in resource scheduling.If it is prior to scheduling resources, it is called proactive orientation, otherwise, it is called post-active [10], [11].This approach is easier in terms of implementation than the previous one (proactive) because it uses job monitoring techniques, while the other method works through probabilities and this requires more information about network resources.If the method used is proactive, such as replication, then all decisions made to handle failure must be made before the task is started, thus reducing the probability of failure and increasing productivity.Table 1 illustrated some examples of fault tolerance.In grids, all proposed mechanisms that deal with the fault-tolerant are divided into three categories [12].The first category is called task replication [9], [13].In this category, the same task is repeated to implement it on many independent resources to protect the task from one failure point.The second is called checkpoint redundancy [14].The status of the running function is saved to stabilize stable.This condition can be used later if any error occurs to persistent storage at different points instead of re-execution, in other words, the implementation is resumed starting from the last stored point and not from the beginning.The last is called adaptive.Both checkpoint and symmetrical copies are used to achieve the task.The adaptive approach greatly improves the performance of the distributed system.It achieves throughput and fault tolerance with optimal parameters.The first category has been considered in this paper, to establish an error-resistant proactive scheduling system and specify the minimum number of replicas needed for each task.

-Related work
Detecting faults and predicting them before they occur is a primary goal during system design such as preventing or avoiding the error that causes the deadlock problem as well as recovery strategies.These strategies are implemented through replication strategy or through some improvement or both.Failure at one source is more likely than failure at multiple sources simultaneously.Therefore, re-implementing jobs from the beginning can be avoided using the replication.Thus, there is no waste of effort or time in the process progression, since a failure in one of these sources does not lead to a network breakdown and it can continue to provide services.
During the development of work on the implementation of the task, the system stores all the information and data for the task at that point in the work, and the system continues to make a backup for all the cases of the task at each new development and within a specific strategy.This point is called checkpoint.If the work fails somewhere, the system will work on retrieving the stored information and resume work from that point and not from the beginning, which saves effort and time.The main problem here is the possibility of repeating by mistake, repeatedly or continuously, which causes a bigger problem.In this research, the replication mechanism will be used by limiting the number of resources that will be exploited in replicas to the lowest possible extent in order to take advantage of time and reduce system expenses.
Abawajy [15] suggested a distributing algorithm for scheduling that combines both scheduling and repetition functionality.The idea behind this algorithm is to divide the network into a group of small networks so that each small network consists of a group of sites and each website has a scheduler for that site.Each scheduling manager, in turn, backs up another manager's scheduling.This algorithm supports each user with a fixed specific number of replicas.
Srinivasa et al. [16] with each gesture within the network propose a middleware to replicate, there is a copy of the task in each node, and through the TCP/IP protocol the communication between these nodes takes place.Jiang and Zhou [17] suggested a fault-tolerant algorithm for scheduling jobs by matching the resource's trust level and the user's security, the number of copies will be determined according to the security level of the network, which is variable.Chtepen et al. [18] introduced a heuristic schedule that relies on replicating functions and rearranging unsuccessful tasks using real-time network state information rather than relying on scheduled job data.
-The aim of the proposed work The network resources are dynamic and heterogeneous, which means the possibility of failures in the network environment is high, thus more time is required to carry out these functions [19], which causes the network's failure.The fact that system will consume more time to search for other resources suitable for carrying out these tasks in the event of the failure of the available resources.Most of the algorithms that depend on symmetrical copies use a fixed number of identical versions [20].This means excessive use of resources for the same task, causing the network to be busy with the minimum number of tasks to be performed.In this work, we suggested an algorithm, to determine the number of replicas that are not fixed for all jobs offered but are variable according to the type of tasks to be performed.

METHOD 2.1. Backup resources selection
The backup resources will be selected upon indicating the replicas needed for the jobs.Consequently, even if one resource fails, the network can still complete the job by utilizing a surrogate resource.The selection of these resources in the research paper under consideration is based on factors such as response time and resource load.The resource information server (RIS) is responsible for storing the historical data of resource loads, which is defined as (1): where  is load history,  is the instructions acts completed by resource j, and  is the rapidity of the resource j measured in seconds for a million instruction operations.

Grid scheduling system
Architectural engineering for the grid scheduler (GS) assumes a group of units that include: the user interface enables users to send their tasks to the network, then the tasks received from users are directed to the available network resources [21]- [25], one of these resources is called the RIS which collects information about the ability of other resources such as the size of the memory and information about the central processing unit.Through this information, the appropriate decision is taken about the scheduling by GS.GS also provides another reliable service about the fault-tolerant structure through the adoption of another component called the fault handler (FH) and during it, cases of failure that may occur in the system are handled.As shown in Figure 1, the assumed scheduling of resources spread in many different geographical regions is managed by a central management unit.It is well known that the behavior of resources towards failure is different between one resource and another, therefore there must be a processor with the ability to deal with errors in the event of occurrence.If the result is outside the expectations, this means that the failure has happened, so, the information is stored about this failure in resource services (RIS) to benefit from it when implementing the next task.

Proposed system
By applying proactive scheduling, the proposed system tries to avoid failures.Also, assuming such failure, the system will reduce the effect of this failure.This is done by creating many replicas on a different set of resources and implementing them at the same time.Therefore, if one of the resources fails, this will not affect the implementation of the tasks in the rest of the other resources and remain in the event of continuous implementation without delay.Upon the completion of the implementation of any replica, all other versions are terminated, and the network resources are edited from it.The minimum number of replicas needed for each task to be completed is determined by the system depending on the knowledge of the inclination of

399
those resources to fail.As a result, this will enable the system to reduce the effects of failure on the network.After that, the system is based on choosing a good group of resources for the purpose of carrying out the task on it by relying on the time of response to these sources, which is the sum of the necessary transferring time for the task from the scheduler to the source, the time of waiting in the queue, the time of implementation and the time to transfer the result from the source to the scheduler.The job scheduling process involves selecting a job from the job queue, considering the user service quality desired.Then, the server of resource information is consulted to obtain a suitable list of resources that meet the user's QoS criteria [26].The role of the RIS is to provide a resource list along with their estimated response times for task completion.Subsequently, the scheduler arranges this list based on the response times of each server.The highest-ranked server is selected as an essential server for executing the function.However, there is a possibility that this primary resource may encounter a failure in carrying out the task.To address this, during the replication phase, the system will elect certain resources from the available roster to act as duplicates of the task.These resources are known as backup or reserve resources.Improving performance can be improved in two ways: a.By carrying out a job on more than one resource at one time, the time to complete the implementation of the task on the first resource is the time of response.It is not possible to fix the response time fixedly, maybe a variable according to the loads and requests for this server at the time, as well as on the condition of the network and the capacity of the server, thus implementing the task on multiple resources that may improve the regime's response time.b.The implementation of repeated functions can help deal with failure.It is sufficient to complete the implementation of a single replica to finish the task implementation, therefore the effect of failure can be reduced which may occur if the implementation is on only one replica.The process of determining the number of replicas is very important, because the increase in the number may reduce the failure to end the task significantly, but at the same time, it will cause consumption of the system resources and the increase in response time (more time for response).On the other hand, if the number of replicas is inappropriate, this may cause the task not to implement.Therefore, must the choice in the number of identical replicas is proportional to the above both cases so that the possibility of implementing the task is high and the effect on the stability of the network is less than possible.

Adaptive job replication
The proposed algorithm determines the optimal number of replicas which is not constant for all the tasks but fits with the inclination of resources to fail.This proportion is expulsive, the higher the number of source failures, the greater the need for more replicas, and vice versa, the fewer source failures, the less need for more replicas.Consequently, the optimal number of replicas can be deduced by relying on the number of times the failure of the resources, which can be calculated based on the history.This number will be variable according to the type of job assigned to that resource.
Let's suppose  represents the count of resource failures in executing its assigned tasks and  denotes the count of successful completions by the resource.Whenever the resource fails to accomplish its mission, the value of  increments by one and the task originally assigned to this resource is reassigned to another appropriate resource within the network.Conversely, if the resource completes its mission, the value of  increases by one.Thus, the inclination to fail  for resources can be represented as (2): Thus, the possible success of the mission's implementation for resource  can be as (3): Assuming that the resources R1, R2, ..., RN are dedicated to task , then the inclination rate for failure in these resources as in (4): The number of replicas of the task, k, was determined to be commensurate with the value of .The minimum number of replicas should be at least one, the number of resources available and suitable for the task should not exceed N. Accordingly, the highest limit of the number of replicas will be N. Thus, the possible success of the mission's implementation for all resources combined can be as (5):

The algorithm
Algorithm 1 named optimal resource allocation and backup strategy algorithm (ORABS) employed to decide the number of replicas for every submitted task involves a comparison between the values of "Bj" and "Bn" starting with the first resource on the dedicated list.
If "Bj ≥ Bn" an additional replica is added.The new list "Bt=Bt+Bj", The algorithm stops if it is "Bt > Bj."The steps of the algorithm as shown: Algorithm 1 : ORABS.
For each task submitted by the user { Receive tasks from the User; From a Fault handler, request "FT" for all resources From Resource Information Server, request "LH" for all resources.Calculate FTj, Bj, FTn, and Bn, for all resources in the grid.such that, if Bj > Bn, then add resource j to the list otherwise no action.
Send packets to the list of servers in step 5 to calculate the RTT Choose a list of servers with the highest response time from the Resource Information Server; Arrange the resources in an upward manner and according to the time of responding to these sources; Calculate the average of the probability of successful resources Count the number of replicas of the tasks.Defines backup resources.}

CASE STUDY
Assume that we have 50 resources R1, R2……...R50.First, calculate the LH using (1) for each server and neglect the servers that have more than medium.CPU utilization ≤66% and memory utilization ≤62% in [24].Second, calculate FTj, Bj, FTn, and Bn, for all resources in the grid.Such that, if Bj>Bn, then add resource j to the list otherwise no action, do the procedure to all the 50 resources (if none of them are neglected by the first step, otherwise just for those remaining on the list).Third, suppose the number of resources remaining in the list is 20 resources, R1, R2......... R20, send a ping to all resources in the list and calculate the RTT.Table 2 represents the ping amount of these 20 servers measured in milliseconds.The total values of RTT will be 3317 ns for all resources, so the average will be 3317/20=165.85ms.Forth, excluding all resources that have more value than average.The result is shown in Table 3.
Table 2. RTT in ms for 20 servers  The tendency to failure is calculated using (2) for all the remaining resources, suppose the tendency to failure of these resources is as in Table 4. Calculate the success probability of the above resources using (5).The success probability of the resources is presented in Table 5.Then, calculate the average of the probability of success for them.The total values probability will be 8.21 for all remaining resources, so the average will be 8.21/9=0.91.Excluding all resources that have less value than average and arranging the table ascending.Therefore, the sources in Table 6 will represent the resources that we can use for the purpose of replication.It is noted that each of the resources shown in Table 6 represents a high probability of success in implementing the task.Accordingly, only three resources can be satisfied to represent the resources of the replication, and the first three are the best of these resources, due to the high possibility of carrying out this task.The possibility of achieving the task increases in the event of increased resources, but this option is not good, because it causes resource consumption, network fall, and instability of the network [27].Figure 2 represents the response time of all resources before any of them are excluded, while Figure 3 represents those with a higher response time.

RESULTS AND DISCUSSION
Independent events refer to occurrences that do not influence each other.When event Q is considered independent of event K, it means that the probability of event Q happening is unaffected by the occurrence of event K.If Q and K are independent events in a random experiment, the probability of both events occurring simultaneously, denoted as P(Q⋂K), can be calculated by multiplying their individual probabilities, represented as P(Q) and P(K) in (6): In the case of multiple independent events, let's say Q1, Q2, ..., Qn, associated with a random experiment.The probability of all these events happening simultaneously, represented as P(Q1⋂Q2⋂Q3⋯⋂Qn), can be calculated by multiplying the individual probabilities of each event.
P(Q1⋂Q2⋂Q3 ⋯ ⋂Qn) = P(Q1) * P(Q2) * P(Q3) * . . .* P(Qn) So, in the case of the first two servers are work, the probability of their failure together is (0.0016) using ( 6), since the probability of failure of each of them is equal to 0.04 (probability of success is 0.96 for each), which means that their failure rate together will be 0.04*0.04=0.0016.If the first 3 servers work together, the probability that all 3 servers will fail is 0.00008, therefore, their failure rate together will be 0.04*0.04*0.05=0.00008using (7).From what has been explained above, we note that as the number of servers increases, this will reduce the probability of failure so that it approaches zero, as the failure rate in the case of using only two servers was 0.0016, and when adding another server, the failure rate decreased to 0.00008, and 0.000004 if using 4 servers, and 0.00000028, 0.0000000252 if we use five and six servers respectively.So, Figure 4 represents the failure probability of joint servers, while Figure 5 represents the probability of their success.By applying (5) we can deduce the probability of successful execution of the task for all servers.From Figure 5 it turns out that the best case is to choose only three servers for execution since the probability of execution on one of them will be very high and close to the confirmed execution of the task.Therefore, execution with a minimum number of resources, the result is not consuming system resources and ensuring execution with a high probability of task execution.

CONCLUSION
For the purpose of avoiding errors and re-executing tasks that cause loss of time and effort, most of the methods used a specific number of resources without taking into account the type and size of the task.Although the number of sources is large, the choice is ill-considered and the results of the implementation are highly uncertain.In this paper, we propose a fault-tolerant approach for scheduling functions in cloud/grid computing.This approach focuses on deliberately selecting a specific number of resources based on their known response times.By doing so, we achieve excellent performance, minimizing the impact on network resources, and ensuring uninterrupted performance.This deliberate selection reduces the number of resources required to execute tasks, thereby maximizing the likelihood of task completion and approaching near certainty.Executing tasks on carefully chosen resources with known response times yields highly favorable results.This approach optimizes performance, reduces resource utilization, facilitates seamless task completion with minimal disruption, and maintains network efficiency.The probability of successfully executing tasks becomes significantly high and nearing a level close to certainty.

Figure 1 .
Figure 1.Proposed system architecture Bulletin of Electr Eng & Inf ISSN: 2302-9285  Customizing the minimum number of replicas for achieving fault tolerance in … (Mahdi S. Almhanna)

Figure 2 .Figure 3 .
Figure 2. Round trip time for all resources

Table 1 .
Examples of fault tolerance

Table 3 .
RTT for remaining servers

Table 4 .
FT for 20 servers