The scheduling techniques in the Hadoop and Spark of smart cities environment: a systematic review

ABSTRACT


INTRODUCTION
In today's world, data growth is reaching new highs based on the recent statistics that by the year 2050, almost 70% of the world population will live in cities. Due to this very reason development of smart cities is extremely necessary.It will be possible to provide smart, efficient, and enhanced solutions by building these smart cities.This all can be done by a smart structure built up.As the towns are getting converted into smart domain form and there is advent in other forms of modern technology, that leads to the rise of the smart city (SC) is gaining much attention; it is now being seen as a new paradigm of intelligent city development and sustainable socio-economic growth [1], [2].To enhance the quality of life, smart city proposes a novel approach to the design and operation of urban infrastructure, including infrastructure for housing, transportation, public services, utilities, health care, and more.
Smart cities are those in which human capital and information and communication technology investments lead to long-term economic development and good quality of life [3].Cities are therefore necessary for tackling significant public and financial challenges, such as low carbon expansion, emission reduction, energy efficiency, shared energy resources, economic development, and more [4].The reason behind moving to smart cities is that they can provide services on a citizen-demand basis.In this way, their needs are responded to in a better way by the organizations and businesses.Two basic requirements to attain these personalized services include the ability to understand the user's current needs and to adapt later on concerning changes in the user's behavior.To serve this purpose, data analysis is needed.However, the precise timing in the placement and occurrence of this analysis is highly crucial.This smart setup must become a reality by appropriately utilizing internet-embedded devices.These devices include sensors and electronics capable of communicating with each other via a network.However, these devices generate a massive amount of heterogeneous data named big data [5]- [7].
The number of devices that are data-producing in smart cities has expanded dramatically throughout the globe, and it will not be wrong to mention that smart cities are one of the primary sources [8].Since then, the world's information output has skyrocketed, leading to a new phenomenon known as big data.Big data is a term used to describe very massive and complicated datasets that cannot be processed using conventional methods [9].Such an extensive data set is a significant barrier to traditional data processing methods.Google launched one of the practical frameworks for processing massive data, MapReduce, in 2004 [10], [11].It is scalable, dependable, and has excellent fault tolerance.In addition, Apache Hadoop is a free, open-source software framework.This framework has dominated big data analysis due to its popularity in many areas, such as the utilization of all the possible hardware resources available regardless of the computing resource from a single server to thousands of serves, a Huge amount of data processed in parallel, fault tolerance, and network load balancing.Companies such as Google, Facebook, and Amazon, have a vast amount of data that require processing to filter out valuable data.Handling this massive amount of data from smart cities is a byword in the current computing area.Since conventional boundaries of the smart city have expanded, allowing for predicting emergency events and real-time management using new technology in an innovative city system, both of which were previously impossible to achieve.Because competent resources are so crucial in the aftermath of an incident, the effectiveness with which they are allocated and scheduled is a critical indicator of any response capability [12], [13].Many researchers are working to find ways and means to handle this big data efficiently.This paper's significant contribution is to examine scheduling techniques in Hadoop and Spark that may be applied in a Smart Cities Environment.This review will fulfill the following objectives: i) provide an overview of smart cities, including their significance and benefits; ii) discuss and analyze the challenges of processing the massive amounts of data smart cities generate; iii) a detailed comparison of Hadoop and Spark scheduling techniques for big data analysis; iv) identify research gaps in current data processing techniques, future research directions and open research issues in real-time big data processing scheduling techniques.The scientific significance of this review paper is that it will help the researchers understand the need to develop algorithms and techniques that can help in the prosperity of smart cities and similar systems, eventually leading to the betterment of humanity.
The division for the rest of the paper is as follows; section 2 explains the smart city and its few characteristics.Section 3 presents the processing of real-time data techniques.The Spark and its comparison with the Hadoop are discussed in section 4. Finally, the paper is concluded along with future recommendations in section 5.

SMART CITY
A smart city should be able to optimize the utilization of all of its assets, both the material (such as transportation systems, energy distribution networks, and natural resources) and immaterial (such as human capital, the intellectual capital of companies, and organizational capital in public administration bodies) in real-time Flood, fire, earthquake emergency rescue and disaster relief, anti-terrorism, remote control of hazardous areas, and so on are some of the many potential uses [14].In contrast to renewable resources (such as solar, wind, and geothermal energy), nonrenewable resources (such as petroleum) will finish over time because of the concept of depletion.In recent decades, experts have promoted the ideas of smart energy [15], green energy [16], and sustainable energy [17] to raise awareness of challenges and develop the best energy usage practices.A smart city has several characteristics, including the transfer of technological, infrastructural, and managerial procedures from rural to urban settings.

Characteristics of smart cities
Specific characteristics, keynotes, and organizational frameworks characterize smart cities; the idea behind this theme is the foundation of a modern, technologically advanced metropolis.A few of the smart city services are given in Figure 1.The figure highlights various features of a smart city, including the education system, health system, daily utility management, smart transportation, government sector, and public sector.The explanations below elaborate on these features, showcasing how technology and datadriven solutions enhance urban life.The concept of sustainability has maintained its prominence throughout the development of the smart city [18]- [20].Preserving energy and natural resources is critical for a smart city to function sustainably [21]- [23].In the early days of the smart city movement, enhancing residents' comfort was a primary focus.To address these issues, various cities throughout the world ran trials.Intelligent lighting has been the attention of specific studies.Citizens may adjust the brightness of the ten thousand sensor-equipped streetlights to suit their needs.The goal is to reduce power consumption by approximately 70% [24].
Smart energy is appealing more since it promotes an all-encompassing approach to coordinating environmentally friendly power, maintainable energy, and a sustainable power source.The goal of ecofriendly energy is to use fuel with minimal environmental impact and the least negative natural consequences.An alternative energy source that does not deplete the planet's resources over time is the best option for meeting the world's energy needs.Increased focus on energy needs has led to a rise in the popularity of renewable energy sources.Much research is going on to integrate renewable energy sources into intelligent buildings.Smart buildings may use renewable energy, or the existing infrastructure can incorporate renewable energy plants.There is a proposal for a microgrid control framework that integrates a photovoltaic (PV) power source with a significant energy storage unit [25].Similarly, Jia et al. [26] propose combining solar and wind power to decrease the dependency on critical energy resources.

Smart transportation
Accessibility at regional and international levels and the availability of cutting-edge, environmentally friendly transportation technologies all fall under the term smart transportation [27], [28].The need for reliable modes of transportation dates back to the dawn of civilization.As technology has progressed, all modes of transportation, including land, sea, rail, and air, must follow the same stipulation.Neither the world's traditional transportation strategy nor its components were linked or interlinked.A cutting-edge linked system has replaced the conventional transportation system due to the concept of everyday interfacing devices.Therefore, modern automobiles are part of various communication and route frameworks.All the automobiles that participate in a particular transmission are linked together.Several standalone transporters are connected to form a global transportation system by increasing the connections inside a single transporter.Intelligent transportation systems (ITS) have given much thought to the ad hoc vehicle network (VANET) [29].VANET has widely used vehicle-to-vehicle (VV) and VV-to-infrastructure (VI) communication capabilities to manage rural traffic.Using the new transportation framework metrics to ensure the metropolitan area's viability comes at the expense of the residents' happiness [30].

Smart healthcare
The present healthcare system is struggling to keep up with the demands of a rapidly expanding population.Furthermore, the issue worsens because medical staff numbers have not increased with population growth.As a result, the healthcare expectations and the delivery gap widen due to a lack of resources and high demand.To meet the need and improve the quality of administration, current innovative well-being administrations use sensor organizations, ICT, distributed computing, computer fog, cell phone applications, and incredible information handling systems [31].Integrating electronic clinical records (ECRs) further allows for timely decisions with the most up-to-date information [32].Another method of achieving satisfactory portable well-being in metropolitan areas was given by [33].Rapid urbanization and increased manufacturing have contributed to a rise in waste production.Effective waste management is possible via the cooperation of the workforce, municipal authorities, and private businesses [34].There are four main phases of waste management, and they are as follows: waste collection, waste removal, waste reuse, and waste recovery.Poor and unmanaged waste management generates challenges in human health and the environment [35], making trash management essential for the economic development of smart urban areas.

THE PROCESSING OF DATA IN REAL-TIME
The problem of processing extensive data becomes increasingly difficult as data volume and diversity both rise.For efficient analytics, it is necessary to have access to the information within this time frame.For instance, real-time data processing is essential in a traffic monitoring system that constantly tracks millions of cars.This processing helps in locating alternative routes and calculating arrival times.Timeliness is of the utmost significance in this context since a mistake or delay might result in the misrouting of an ambulance, putting lives at risk.With more and more people needing access to decision-making tools in realtime, timeliness has emerged as a crucial indicator of data quality.Therefore, having enough time to handle massive amounts of data in real-time is vital.As a bonus, the timely nature of big data might aid in analyzing event streams to enable real-time decision-making.Therefore, the diverse data sets provided by many data sources must be integrated into a unified analytical platform to minimize potential delays in real-time processing [36]- [39].The flowchart of data processing in real-time is given in Figure 2. It begins with realtime data collection from various sources in the city.A framework is then selected to handle the collected data efficiently.Big data analysis is conducted to derive insights and identify patterns.Finally, optimized smart city services, such as smart transportation, energy management, waste management, and more, are implemented based on the analysis.This systematic approach leverages technology and data to improve urban living.
Real-time data processing is essential for maximizing the effectiveness of smart city services.However, effective scheduling becomes crucial to guarantee the prompt delivery of services and the effective completion of tasks.Tasks are prioritized and efficient timetables for various processes are created using scheduling algorithms and policies.In complex real-time operations, scheduling is especially important for ensuring punctuality and meeting requirements.A schedule that meets most requirements for a particular set of processes is considered optimal in this context.

Scheduling
A scheduler will prioritize the tasks using an algorithm or policy.The job of a scheduler is to create a timetable for a group of processes.A process set is realistic if it can timetable itself to meet specific requirements.Complex real-time periodic operations often need a guarantee of punctuality.An optimum

Big Data analysis
Optimized Smart City sevices schedule is a schedule that meets most of the specified requirements for a given set of processes.In most cases, a scheduler is optimum if it can schedule every possible collection of operations [40].Static and dynamic [41] are two ways to categorize scheduling algorithms.

Static scheduler
Static scheduling, in which a schedule is generated offline.All scheduling decisions, such as when to execute each operation or send each message, are contained in the program.During runtime, a simple dispatcher distributes jobs based on the schedule.Static scheduling is sometimes known as time-triggered scheduling [42].All scheduling choices are stored in a table for usage at runtime.It is only possible to do this with previous information on how the process works.Therefore, this plan can only function if all operations are genuinely periodic.Although it demands insight into a process's traits beforehand, the overhead it imposes during execution is negligible.Real-time shortest job first (SJF) and rate monotonic (RM) are appropriate algorithms for static process scheduling.In both algorithms, priority is allocated depending on the deadline and time required to finish the task [43].

Dynamic scheduler
On the other hand, a dynamic approach establishes schedules during execution, providing a more adaptable system capable of handling unanticipated occurrences.It is plausible to claim that in safety-critical systems, all events should be predictable, and stimulability should be the primary concern before any action; this means it needs a scheduling method that is entirely unchanging across time.Online schedulers make scheduling choices while the system is actively running.It can be both static and active.These choices are grounded in the process context's past and present state-the current systemic condition.The term clairvoyant refers to a planner or scheduler.Two commonly used dynamic schedulers in real-time systems are the least slack time first (LST) and the earliest deadline first (EDF).In these algorithms, priority is decided based on slack time and deadlines of the given processes.These both are considered more suitable for soft real-time operating systems [43].The objectives for the few static and dynamic scheduler algorithms are discussed in Table 1.

Algorithm Objectives achieved Static
Highest level first with estimated Time [44] Minimized running time It simplified the list scheduling algorithm Critical path on a processor [44] They limited the cost of computation and time consumed Constrained earliest finish time [45] Reduction in implementation time Multipriority queueing genetic [46] It decreased the execution time for subtasks Parallelism-based earliest finish time [47] It reduced the finish time Dynamic Dynamic level scheduling [48] It decreased scheduled time Dynamic task scheduling [49] Less complicated, and less time is taken to finish the tasks Dynamic load balancing using genetic algorithms [50] Optimized load balancing and processor consumption along with high speed New response time bounds for fixed priority [51] Better response time Load-based schedulability [52] Scheduling based on priority

Hybrid scheduler
Schedulers may be either preemptive or non-preemptive.In most cases, pre-emption happens when a process with a higher priority becomes executable.As a result of pre-emption, a procedure might go on hold without the participant's consent.It is not the practice of non-preemptive schedulers to temporarily suspend running tasks; however, it can manage concurrency for processes running inside a resource with mutually exclusive access [53].
It's also feasible to use a hybrid system.A scheduler can have a pre-emptive design while allowing processes to work in less time and then put them on hold; it may define an immutable block of code that another method cannot bypass.For instance, the program may poll the system clock for the current time, use that to determine how much of a delay is required, and then implement that delay.If the process could pause between reading the clock and performing the hold, it would be impossible to write such code.Using caution while implementing code that uses delayed pre-emption primitives is essential.The ensuing blocking must be limited and minor-often of the same order of magnitude as the overhead of context switching.The computer's scheduler uses this strategy to enable a rapid context switch; the switch operates up to 50 processor cycles to postpone itself; as a result, the context to be moved is short, and only ten additional cycles can accommodate the modified context [54].

Previous works
Numerous studies have been conducted on task scheduling, exploring various algorithms and models.One notable study by Liu and Layland in 1973 [55] focused on the earliest-deadline-first (EDF) scheduling algorithm and fixed priority (FP) scheduling.They investigated these algorithms using the ordinary periodic job model, without self-suspensions and demonstrated that EDF is an optimum approach to meeting commitments.Additionally, they established the superiority of the rate-monotonic (RM) scheduling algorithm among FP techniques.
Another study in [56], [57] centered around configuring and scheduling emergency resources during fire catastrophes.They developed a dynamic model to analyze and address this critical aspect.Similarly, constructed emergency resource scheduling models, considering factors such as arbitrary initial time for rescue operations and a fixed number of rescuers [57], [58].In 2010 Sandholm and Lai [59] proposed a dynamic proportional share scheduler.This scheduler is an enhancement to Hadoop schedulers that gives the volume quality of service (QoS) to diverse users based on priority.This process allows the handler to choose tasks and schedule them according to their preference.Change in the allocated resources based on the work requirements is also doable.This scheduler becomes fair in case of no users and resource requirements.
In 2016, Zacheilas and Kalogeraki [60] introduced a cost-effective scheduling technique.This strategy aims to meet financial constraints while also improving task completion time.This method implements the Pareto approach.This scheduler aids in decreasing completion time and giving better throughput.One aspect that influences a cluster's overall performance is Job response time.This aspect inspires Zaharia et al. [61] to suggest a longest approximate time to end (LATE) scheduling algorithm to improve response time.This method processes the backup task of a slow task on a separate node.Various factors, including increased CPU usage and the sluggishness of background tasks, are the reason behind the task's slow progress.
Locality-aware reduced task scheduling (LARTS) [62].This algorithm aims to enhance data localization, and as a result, there is minimum network traffic.This study also addressed premature shuffle concerns.Although early shuffle improves performance and reduces turnaround time, it also burdens the network.Therefore, LARTS requested that the shuffle begins once the specific addressing processing is done; the sweet spot is the name for the beginning point of the shuffle.In 2012, Guo et al. [63] proposed delay scheduling, which addresses the disadvantage of the fair scheduler by attempting to remove the difficulties of locating the tasks.When a request for a new task enters delay scheduling, it finds the job that meets the equality constraints and does not assign the job if conditions are not fulfilled.
Table 2 presents a comprehensive comparison of the discussed techniques with other approaches.The table provides a detailed evaluation of various factors, such as performance metrics, scalability, resource utilization, and adaptability.By comparing the discussed techniques with alternative methods, this analysis offers insights into the strengths and limitations of each approach, aiding researchers and practitioners in selecting the most suitable scheduling technique for their specific requirements.[62] Data locality ✗ ✓ ✗ N/A Parental prioritization-based task scheduling algorithm [64] Fairness N/A N/A ✓ ✗ Modified particle swarm optimization algorithm [63] N/A N/A N/A ✓ ✗ A hybrid of genetic and particle swarm optimization [65] N/A ✗ N/A ✓ ✓

Computable and decidable
The computational cost and complexity of scheduling for intricate systems are a genuine concern.Online scheduling methods should avoid using scheduling algorithms with exponential complexity because of their severe influence on the amount of time spent on application software.Furthermore, some scheduling considerations are computationally intractable, making them inappropriate for offline scheduling.Therefore, computability and decidability must be considered two aspects of computational complexity.The computability of a schedule determines if a given schedule is feasible.At the same time, decidability helps to assess whether a possible schedule exists [40].

DATA PROCESSING FRAMEWORKS
In big data analytics, efficient data processing frameworks serve as the backbone of handling and analyzing vast amounts of information, which includes the data generated by smart cities.These frameworks provide the necessary tools and infrastructure to extract valuable insights from diverse data sources, enabling cities to make data-driven decisions and optimize urban life.Hadoop and Spark are two prominent data processing frameworks that have revolutionized the field and found extensive applications in smart cities.

Spark framework
Spark is an open-source framework for processing large amounts of data quickly and easily.This approach debuted in 2009 at Berkeley and was officially adopted by Apache the following year.Iterative algorithms in machine learning, interactive data analysis tools, and graph algorithms are all examples of recursive systems that benefit from repetition [66].As a result, the programmers developed the Spark framework [67] to accommodate these programs while providing scalability and fault tolerance in the MapReduce framework.Parallel operations on these datasets (referring to providing a function to utilize a dataset) and resilient distributed datasets [68] are Spark's two primary abstractions for parallel scheduling.Resilient distributed datasets were first made possible by Spark (RDDs).Distributed read-only datasets (RDDs) are groups of read-only items kept on many computers but can quickly reassemble in case of partition removal.It allows the user to store the RDD in the machines' memory and run the parallel process, such as MapReduce, many times.As a result, Spark excels in processing recursive algorithms on datasets [69], [70].

Hadoop vs Spark
Aziz et al. [71] analyzed Twitter data using the Spark platform in 2018.It took one second to explore all the tweets on Spark.This research has centered on the author's examination of the actual execution and completion of the standard Hadoop MapReduce framework, as well as the implementation of the Apache Spark framework.Experiment simulations are also run to determine actual-time data utilizing Spark and Hadoop.In addition, there is a discussion of Hadoop's constraints and benefits when it comes to its implementation in the real-time process.Finally, there is a simulation comparison regarding speed for both frameworks.All that shows that Spark is a powerful tool for processing real-time data streams.
In 2017 Hazarika et al. [72] evaluated the theoretical and practical differences between the Spark and Hadoop systems.From what they've seen in their studies, Spark's cache benefits from repeated queries like logistic regression and makes them significantly quicker.On the other hand, Spark's performance is weak for nonrepetitive queries because of the small cache size.Small iterations, however, benefit considerably from Hadoop's speed.
In 2015, Gopalani and Arora [73] examined two large data processing frameworks, Hadoop and Spark.To put it another way, they used Hadoop and Spark to apply the K-means algorithm, a fundamental machine learning technique, using a dataset comprised of sensor data and then comparing the two platforms' respective execution times.Data showed that Spark performed better than Hadoop in real-world scenarios.Furthermore, Gu and Li [74] conducted another comparison of memory needs and processing times for the Hadoop and Spark systems.The PageRank algorithm was implemented in several network datasets in the same study.According to the findings, Spark used more memory while simultaneously taking less time to execute, as impressive is the fact that Spark is 73% faster than Hadoop when dealing with massive datasets.
In 2013 Zaharia et al. [75] used logistic regression to examine the Hadoop and Spark frameworks.The author of this study focused on a subset of software programs that recycle data from an active, dynamical database using a multi-threaded, parallel architecture.These include many iterative machine-learning algorithms and interactive data analysis tools.Spark introduces an abstraction known as resilient distributed datasets (RDDs) to help achieve these objectives.Spark can beat Hadoop by a factor of ten in repeated machine learning tasks, and it can be used interactively on a 39 GB query dataset with a response time of less than one second.According to the findings of this article, Spark is the preferable option.
Liang et al. [76] compared Hadoop, Spark, and big dataMPI in terms of execution speed, memory footprint, and central processing unit consumption in 2014.The author uses Big Data Bench, a benchmark suite for large data sets, to conduct in-depth analyses of Spark, DataMPI, and Hadoop's resource use characterizations and performance.In these investigations, DataMPI delivered a 57% improvement over Spark.Furthermore, it has improved Hadoop by 50% regarding job implementation time.DataMPI's main advantages were its efficient communication mechanisms and its high throughput.In addition, DataMPI makes better use of its resources (disc, CPU, network I/O, and memory) than the other two structures and frameworks.As a result, the MPI platform outperformed both Spark and Hadoop, and Spark even surpassed Hadoop.
Mavridis and Karatza [77] assessed the performance of log file analysis using both Hadoop and Spark.They have looked at log file analysis using the cloud computing frameworks Apache Hadoop® and Apache Sparks.The authors have enhanced the log file analysis in both frameworks so that they can handle real-world data from the Apache Web Server.They have also conducted other tests with varied parameters to evaluate and contrast the two frameworks and structures.Im and Moseley [78] used MapReduce to examine conditional lower bounds on graph connectedness.This research discovered the possible problems that don't allow efficient external algorithms to integrate into MapReduce.This study also answers a fundamental research question: how to tell whether a graph has a closed cycle.In particular, they examine the issue of designing an algorithm to determine whether or not two unconnected processes exist in a given network.This challenge aims to verify the graph's global structure so that all local graph parts are equivalent.They identify the natural class of algorithms that can only transfer/store/process data and information in paths, proving that no random algorithm can answer the question in a sublogarithmic number of rounds.Kodali et al. [79] work on a k-NN-based method using MapReduce for meta-path categorization in heterogeneous information networks.The authors of this study used the Passim similarity measure in a Heterogeneous Information Network to classify the meta-paths uncovered by applying the well-known MapReduce paradigm to the problem of locating k-nearest neighbors.Moreover, they figured out the classification technique to deal with the massive data found in HINs using MapReduce.
Wang et al. [80] conducted a study on MapReduce task programming with excessive energy consumption in heterogeneous clusters; as a result, there was a task programming framework for heterogeneous groups that considered resource utilization, deadlines, and data locality to keep energy costs to a minimum.The framework includes updates to the slot list, new task lists, and scheduling.In addition, a proposal for a novel job sequence to create a rational list of jobs and tasks based on factors like expected work processing times, available job slots, and due dates.Wei et al. [81] introduced a MapReduce-centric clustering method for handling large datasets.Their study compared and contrasted the MapReduce implementation of the Canopy method with the widely used K-means algorithm.By evaluating their performance and effectiveness in clustering large datasets, Wei et al. [81] shed light on the advantages and limitations of these approaches.
In a related study, Roger et al. [82] proposed a preemptive fair scheduler strategy for the disco MapReduce architecture.They explored how the Preemptive Fair Scheduler Policy impacted job execution times in both experimental production and research settings.While the strategy proved beneficial in reducing execution times for production jobs, it had a negative impact on research jobs.The author provided insights into the trade-offs and considerations of implementing the Preemptive Fair Scheduler Policy.
Jang et al. [83] proposed investigating k-nearest neighbor input initialization for neural network inversion.This study reveals a fresh way of initializing the input variables of neural networks, centered on the k-nearest neighbor technique (k-NN).The proposed method finds inputs that generate an outcome near a target output within a training dataset and combines them to form the starting input variables.Chen et al. [84] performed quick peak density clustering for large-scale data emphasizing kNN.The proposed methodology, computed using a fast kNN algorithm like a cover tree, significantly improves over the previous method of computing density using kNN-density.It uses kNN-density and a quick form to differentiate between local and nonlocal density peaks.
Janardhan and Samuel [85] investigated the optimal parallelism in the Spark architecture on Hadoop yet another resource negotiator (YARN) to get the most out of the cluster's resources.This research suggests the best parallelism conformation and configuration for an Apache Spark architecture deployed on a Hadoop YARN cluster.However, the concepts depend on the studies' findings that examine the reliance on parallelism at each level of Spark application performance.A zone-based resource allocation technique called Zebras enhances Spark's efficiency in a heterogeneous cluster and has also been proposed; by proposing and implementing this technique, optimizing resource utilization and allocation within the Spark cluster ultimately improves its overall performance.
According to Hussain and Surendran [86], efficient content-based fast-response picture retrieval is explored using the MapReduce and Spark model framework.The authors leverage the MapReduce model structure to sign efficiently and index massive volumes of photos, enabling fast retrieval based on content.Furthermore, in 2021, Mostafaeipour et al. [87] adopted Spark as a proportional method for recovering the index, operating on the upper layer of the MapReduce framework and utilizing the Hadoop distributed file system (HDFS).Their work focuses on efficient index recovery using Spark's capabilities within the MapReduce ecosystem.
In addition to the insights provided, Table 3 further reinforces the key differences between Spark and Hadoop.The table highlights specific research gaps and indicates whether each gap is present in Spark or Hadoop.This comprehensive comparison aids in understanding the unique strengths and limitations of each framework, enabling researchers and practitioners to make informed decisions regarding their data processing needs in the context of smart cities.

CONCLUSION
As the world merges toward the era of smart cities highly dependent on IoT and Web Apps.Smart cities are gaining popularity as they positively impact a country's economy.Intelligent and rapid decisionmaking are critical requisites of a sophisticated smart city system.At the same time, this system generates multiple files known as big data that revolve around the characteristics of the 3 V's, which has led to the recognition of a great problem.New ideologies, strategies, and frameworks must be introduced to constructively overcome the issue of handling and scheduling big data.This article provides an overview of a thorough study of work done for scheduling techniques in the Hadoop and Spark environments.Dynamic Scheduling is crucial to achieving high performance in extensive data processing.Data volume, diversity, data velocity, security and privacy, cost, connectivity, and data sharing are just a few of the difficulties with big data.From the conducted review, it can be easily said that the baseline is adequate for processing if the data is static, and it is possible to wait until batch processing is finished.However, Spark has had an advantage regarding real-time data processing in parallelism.It still needs extensive research to conclude that Spark is the only solution for analyzing real-time streaming data.
Additionally, as demonstrated in the study, Spark could evaluate data quickly.Spark is a top-notch memory processing technology that enables real-time streaming data processing on massive amounts of data.Compared to Hadoop, Apache Spark is far more sophisticated.It supports several needs, including batch, streaming, and real-time processing.In the future, schedule optimization can be done for Hadoop.For Spark, it can be done by modifying various default parameter configuration settings, introducing new scheduling techniques, and hybrid artificial intelligence scheduling.

Figure 2 .
Figure 2. Real-time data processing flow chart

Table 2 .
Scheduler techniques comparison table

Table 3 .
Data processing aspects comparison table The scheduling techniques in the Hadoop and Spark of smart cities environment: a … (Nada Masood Mirza) 461