Systematic literature review: application of deep learning processing technique for fig fruit detection and counting

ABSTRACT


INTRODUCTION
In the field [1] of detecting things in images, computer vision and pattern recognition are tools in this growing field.Object detection methods have numerous applications, including the detection of vegetables and fruits [2], [3].The recent explosion in the capability of both AI algorithms and image sensor technology has led to the rise of the automated fruit detection system [4], [5].Fruit detection in orchards has traditionally relied largely on manual visual inspection, which is both time-consuming and labor-intensive [6]- [10].Human perception is used in conventional orchards to record agricultural data, however there can be large discrepancies between the two sets of data because farmers have wildly diverse levels of competence [11].Hence, for developing autonomous harvesting, targeted medicine applications, and many other intelligent farm machinery technologies, an effective automatic detection approach for the agriculture sector is essential [12]- [14].Deep learning approaches based on machine vision can extract hidden patterns from agricultural datasets to construct and build a prediction framework that can help agriculturists diagnose [15].The fruits' size, colour, and form are used to train neural networks [16].In machine learning and deep learning, computer vision algorithms have enhanced the efficiency of image recognition and detection tasks [17].The findings [18] indicate that machine learning can enhance the performance of a picking robot's detection technique.Deep learning's progress has turned computer vision into an agricultural vision for target detection and image semantic segmentation, giving the best results [19].Because of its ability to handle large amounts of data, deep learning has proven to be a very effective tool [20].The interest in using hidden layers has exceeded traditional techniques in terms of popularity, particularly in object recognition, classification, and detection [21].One of the most popular types of deep neural networks is known as a convolutional neural network (CNN) [22]- [24].
Due to the fast growth of CNN in recent years, the detection accuracy and speed of CNNs are frequently superior to traditional object detection methods [25].CNNs in their many versions have been put to use in the detection of a wide variety of fruits [26]- [28].CNN is built on a convolution-shaped pyramid and pooling layers, which minimise picture width and height while enhancing depth measurement [29].As a result of this, the classifier is positioned atop a pyramid, which serves the purpose of connecting the many nodes that make up the neural network.
This paper critically examines the detection and counting of fig fruits using deep learning by conducting a systematic literature review.Literature indicates that deep learning outperforms conventional fruit identification, recognition, and counting techniques.Aside from that, deep learning in agriculture is both time and cost-efficient.In this research, we implemented the SLR based on the standard review methodology called "Reporting Standards for Systematic Evidence Syntheses" (ROSES).The SLR method used in this study is looked at from the start of the research question to the end of gathering and analyzing data.

METHOD 2.1. The review protocol-ROSES
The current study's SLR follows the ROSES review process [30]- [33] as a reference.ROSES review protocols, as can be seen illustrated in Figure 1, are specifically created for SLR.Haddaway et al. [34] stated that ROSES also includes a comprehensive set of reporting requirements for the conservation and environmental management research synthesis community.Formulation of the topic of interest, systematic searching techniques, quality evaluation, and data abstraction and analysis are the four main steps in conducting SLR according to ROSES.Three sub-processes are required in systematic searching strategies: identification, screening, and eligibility of obtained articles.Only high-quality articles related to the main research question are selected and reviewed throughout the SLR procedure.

Formulation of research question
The preparation [35] of the research topic or question is based on the population, intervention, and Context (PICo) tool, defined based on three main elements: population, interest, and context [36].This tool  [37] to evolve research questions suitable for the review.Based on this tool, the population of this research was deep learning, the interest is detection and counting, and the context will be fig fruit in the wild.The three main aspects of this review, based on this technique, were then utilized as a guideline in generating the primary findings, which are: RQ1: Which deep learning model approaches were used to detect and count fruit in the wild?RQ2: What is the dataset preparation process used?RQ3: What is the performance of each deep learning model overall?

Systematic searching strategies
The next section will discuss the three sub-processes that make up the second SLR approach.These sub-processes are the identification, screening, and eligibility procedures that are used to locate relevant and related articles for this review.Through this section, the number of articles can be limited from a large number to a limited number, from which only the best and most appropriate papers will be used and focused.

Identification
The method for identification involves looking for words that have the same meaning as the phrases, are related to them, and serve as the primary keywords for the investigation.The objective is to expand the search capabilities of the chosen database so that related articles may be found more easily for the review.The keywords are selected in accordance with the study's objective and problems [38], and they are retrieved with the help of three main indexed databases: IEEE, Scopus, and Web of Science.The keywords used to discover related articles were derived from the study question.Furthermore, to avoid a low number of papers being retrieved, the study widened the scope by including fruits and vegetables to collect more papers.Table 1 displays the exhaustive search string built by the researchers using the Boolean operators "AND" and "OR", phrase searching, compression, and wild cards in both databases.This approach successfully retrieved 1032 IEEE articles, 417 Scopus articles, and 206 Web of Science (WOS) articles.

Screening
The 1655 papers in this study were filtered using the criteria for article selection.The method was carried out automatically using the database's sorting function.The research question produced in the preceding procedure served as the basis for the selection standard [39], [40].Then, the publications were collected between 2017 and 2021, and only English-language published articles were chosen from IEEE, Scopus, and WOS.Furthermore, this study only reviews published journal and conference papers to ensure that acceptable scientific papers relating to our study are included.Other kinds of papers, like books and article reviews, did not make the cut.During this process, about 1249 papers were not included because they did not meet the criteria.Table 2 shows the inclusion and exclusion articles based on timeline, language and type of source.

Eligibility
The third step is the eligibility process.In this study, all of the articles that were found were checked manually to make sure that after the screening process, all of the remaining articles met the criteria [41].The procedure was completed by reading the titles and abstracts of the selected papers.During this procedure, 373 articles were eliminated since their principal goal was unclear and they did not have a major impact on Besides, for papers that have duplicates in different resources, one of the papers will be removed to ensure that the study does not refer to the same paper.The remaining 33 papers were chosen for the quality assessment step of the procedure.Based on Figure 2, the flow diagram shows that the research method had narrowed down the focus of this study to only 33 papers to be reviewed from 1655 papers identified by using a systematic literature review.

Quality appraisal
The articles were displayed utilizing the quality assessment to choose the high-quality content [42].The papers were categorized into three quality levels during this process: high, moderate, and low.To determine the rating of quality, high and medium articles were examined based on the methodology and results of the articles.The research was evaluated using the following quality criteria: QA1: Is the study related to the research objectives?QA2.Is the deep learning models technique mentioned in the study?QA3: Is there a description of the data preparation method?QA4: Is the research methodology stated in detail?QA5.Has the effectiveness of the proposed methodology been analysed?

Data abstraction and analysis
In-depth, the researcher screened the abstract, methodology [42] , results, and discussion parts for all 33 papers.The research questions were used to keep track of the data abstractions.This meant that any data from the studies that could help answer the research questions was abstracted and put into a table.

RESULTS AND DISCUSSION
A total of 33 papers were reviewed based on the research method.Several aspects were developed based on the systematic review, including publication year, deep learning model technique, dataset

Selected primary studies
Utilizing IEEE, Scopus, and WOS, 33 papers were selected as the primary study for review based on the research method.All the selected papers discussed the deep learning model approach.The selected primary study's identification (I.D.), publication titles, authors, and publication year are presented in Table 3.

Publication years
This study's publication year has been set between 2017 and 2021, covering the study within the last 5 years.Since deep learning is one of the recent studies, especially in the agriculture sector, it is best to review or study the recent studies.Figure 3   In addition, there has been a rise in the number of publications published thus far this year.During 2019, seven studies were released.When compared to 2019 and 2020, there was a considerable drop in 2021.This decrease could be attributed to the effects of the Covid-19 pandemic, which resulted in fewer conferences due to sanitary regulations.However, since this study was conducted in 2021, it is normal that only 6 studies be published in 2021 since they might not be readily available for analysis.It shows that the study of deep learning in agriculture is getting more and more popular every year in the world we live in now.

Quality appraisal result
Based on the Q.A. question explained in section 2.4, 33 studies have been chosen and analyzed.To measure the quality of the articles, the researchers modified the scoring technique used by Alsolai et al. [53].
The following was the quality evaluation score system: i) One-point equals Yes, ii) 0.5 points = Partially, and iii) 0 equals No.The scoring points separated the articles into three categories: i) zero to two points were regarded weak, ii) two to five points were considered moderate, and iii) three to five points were rated strong.Table 4 shows the summaries of the Q.A. analysis results.21 studies have been rated as strong, with five having a full score that meets all the criteria in the question.Next, 12 studies were rated as moderate, with most scoring three points, and only one study scored two-point-five.

Deep learning approach
There are two main types of deep learning detection techniques used for image detection systems.The first type is the detection method based on region generation, also known as two-stage target detection, in which an algorithm first generates a series of candidate frames, and then classifies the targets in those frames.The region convolution neural network (RCNN), the mask RCNN, the fast RCNN, and the faster RCNN are all good examples of this type of network.Though effective, these techniques are too timeconsuming and cumbersome to be used in real-time detection settings [50].
The second type is regression-based methods, which simultaneously perform target localization and target category prediction (hence the name "one-stage target detection").Among the many types of networks available, single shot detection (SSD) and the you only look once (YOLO) series stand out as particularly effective applications.Santos et al. [48] mentioned that this family of methods has a quick recognition speed and can meet real-time requirements.
Table 5 summarises the results of a meta-analysis that found 10 studies had used a single-stage object detection approach.One study employed YOLOv2, two employed YOLOv3, two employed YOLO v4, and one employed SSD.Additionally, three studies have modified the architecture or network of the YOLO to create their own YOLO deep learning model and improved the performance.The studies from S8 and S31 have modified the YOLOv3 model and improved the performance, while the study from S12 has created its own YOLO model, named MANGO-YOLO, based on the architecture of the YOLO network.5 clearly shows that the majority of the studies used two-stage target detection, particularly mask RCNN, as deep learning models in their studies.Figures 4 and 5 display the architecture or network of one-stage and two-stage detection, respectively.The two-stage detector can be split by a region of interest (RoI) pooling layer [52].The region proposal network (RPN) is the initial stage that predicts possible bounding boxes [54].For the subsequent classification and bounding box regression tasks, features are pooled from each candidate box using the RoI pooling technique, which is the focus of the second phase.However, a one-stage detector makes bounding box predictions in a step, without the need for region predictions.It uses a grid box and anchors to limit the shape of the item and pinpoint where it is in the picture [44].

Dataset preparation
Preparing a dataset is one of the must-do steps or processes in object detection, recognition and counting.There were several processes of dataset preparation, such as the collection of datasets, resizing, auto-orientation, annotation, data augmentation and data splitting.Annotation was a critical step in object detection.Every study on object detection, recognition, and counting must go through an annotation process [46].Table 6 shows all the studies from S1-S33 were doing annotation processes for the data preparation.Annotation is done with the help of professional human annotators using specified labels.In simple terms, image annotation entails adding metadata to a dataset that allows machines to recognize certain items in the image.Thus, the dataset that has been annotated must be verified by the expert.Next, resizing was also an important process in object detection besides doing an annotation.Every deep learning algorithm has its standard size of input or image to be extracted [7].Moreover, Liu et al. [26] investigated that resizing the image to a smaller size can reduce the training time for the model to recognize or learn the image.However, if the images were resized too small, such as 76 x 76 pixels or 144 x 144 pixels, the input image would not be sharp enough for the network to recognize and learn the input image.This sampling method has two key drawbacks [55].
First, fine-grained visual characteristics essential for detecting small, abstract objects like balls may be lost due to possible image subsampling.The elongation of items in the image caused by resizing to a squared format results in a change in the characteristic shape of the fruit in particular and contributes to additional distortion of the fruit's aesthetic qualities.The common sizes of the input image for object detection, recognition, and counting were 416 x 416 pixels and 512 x 512 pixels [8], [10].If the image size were larger than this standard size, it would cause the image's resolution to be high, thus increasing the training time for the model to learn and recognize the input image [27].The higher the image's resolution, such as 1080 x 1080 pixels, the sharper the image will be for the model to recognize even the small objects in the image, causing the longer training time.Hence, following the standard size of 416 x 416 pixels and 512 x 512 pixels that other researchers have recommended is the best option.
Based on Table 6, the dataset used in object detection, recognition, and counting studies can be divided into two types.The first is to use public datasets, which have been published by other researchers and have given permission to the other researchers to use their datasets.There were 14 studies based on tables that used public datasets.Typically, public datasets may be found in the MS COCO dataset, which contains about 330 thousand photos and 80 object categories [13], [56], ImageNet is a database of over a million photos and one thousand different kinds of objects [43] and the Kaggle dataset [57], which has been collected manually by the researcher and published on it.Besides, Roboflow was also one of the public datasets that new researchers or other users could access.Roboflow provides users access to public datasets and the ability to submit their own custom data.Roboflow supports several different annotation and export formats [58].All the data in these public datasets have been resized and readily used by the user.
The second uses the personal dataset, which is manually collected at the orchard.From our survey, 19 studies collected their datasets manually.There were several reasons why the researchers used their dataset, such as limited resources for the dataset they needed to use for their study.Furthermore, employing a personal dataset might provide a selection of photos that match the researcher's requirements.The public dataset did not have all the images that could satisfy the researcher's requirements.For example, the researcher needs data for fruit under different weather, location, and background [11].
There were 12 studies doing data augmentation, including 7 studies on personal datasets and 5 on public datasets.Data augmentation is a technique for solving the problem of overfitting in the training stage of CNNs [59].Overfitting occurs when random noise or errors are presented instead of the underlying relationship.The researchers used several data augmentation techniques, including brightness adjustment, blur, cropping, rotation, flip, zoom, and noise disturbances [27].Bargoti and Underwood [43] have investigated that after adding more photos data, the model can learn as many irrelevant patterns as possible during the training phase, which helps it avoid overfitting and makes it better at its job.Last but not least, Xu et al. [60] mentioned that splitting the dataset correctly into training, testing, and validation was also an important task for better performance.The whole database was separated into two datasets, the training set and the testing set, with the images being selected at random.When it comes to neural network applications, the most common training/testing dataset splitting ratio is 80/20 [61], although other similar splitting ratios, such as 70/30, should not significantly affect the performance of the resulting models [50].

Deep learning model performance
The effectiveness of the deep learning model will be discussed in this section.Researchers in the 33 studies used 10 different deep learning models.The effectiveness of the deep learning model must be assessed.It is because, by evaluating the performance, the deficiency of the model to detect the object can be improved to achieve the study objective.It is proven that the studies were successful in achieving their objectives.The highest the performance of each model is scored for every metric, the better the deep learning model.
It is essential to evaluate the efficacy of a deep learning model by looking at its accuracy, precision, recall, F1-score, and average precision (AP).Table 7 illustrates the data taken from it about how accurate, precise, recallable, or sensitive it is.There are 4 main ideas that are utilized to evaluate performance metrics: true positive (TP), true negative (TN), false positive (FP), and false negative (FN) [14].It is true when the prediction is right and the predicted label matches the ground-truth label, and it is false when the predicted label does not match the ground-truth label [62].The label that was predicted was either negative or positive.
Overall, if the prediction is wrong, the first word will be false.If not, then it is definitely correct.True positive and negative rates should be maximised, while false positive and false negative rates should be kept to a minimum.And for detection purposes, the prediction needs to account for where in the image an instance of the class is [55].The amount of overlap between the discovered bounding box and the ground truth object was applied to evaluate the accuracy of the detection [55].For a detection to be considered accurate, there should be more than a 50% overlap in between predicted bounding box (Bp) and the ground truth bounding box (Bgt) [63].Based on Table 8, 17 studies focused on evaluating their deep learning performance on F1-score, 12 evaluated their model performance through AP, and 9 evaluated the model's accuracy.In contrast, the study from paper (S10) did not detail the performance of their deep learning models.The researcher evaluated the performance based on the mean relative error (MRE  [64] stated that precision and recall are so closely linked that we can only utilize the F1-score, which considers both precision and recall when calculating the score and how well the forecast matches the ground truth.For a comprehensive evaluation of a model's efficacy, researchers utilise the F1-score, which gives equal importance to accuracy and recall [26].A F1-score is calculated by averaging the classification model's recall and accuracy.The precision of a prediction is measured by how closely it matches the actual outcome [22].Accuracy, however, is of use only when the dataset or sample is well-balanced.Accuracy [60] is defined as the ratio of true positive samples to all anticipated positive samples.Recall [25] measures how well a model is able to predict real positive samples relative to the total number of actual positive samples.In general, recall decreases as precision improves [49].Precision and recall levels can be shown by plotting a P-R graph [23].In order to weigh the relative importance of accuracy and recall in a model's overall evaluation.Yijing et al. [50] proposed utilising AP as complete evaluation indicator.Better model performance is indicated by a bigger value of AP, which is the area under the P-R curve [52].

CONCLUSION
An SLR was performed on the development of AI in agriculture recently, with special focus on the use of a deep learning algorithm for detecting and counting fig fruits.However, due to the scarcity of articles on fig fruits, the review expanded the scope of the study to include fruits and vegetables.This systematic review study is meant to add to what is known by giving an overview of the available deep learning models used to identify, recognise, and count fruits, as well as the procedure for preparing datasets and evaluating deep learning model performance metrics.The best deep learning models for detecting and counting fruit in the wild will be identified, along with their advantages and disadvantages, the goals of the various dataset preparation processes, the most effective performance metric for evaluating the model, and the research gaps that need to be explored.It would be simpler to identify all relevant alternatives of the relevant search terms to cover practically all associated information, and to gain a comprehensive enough reading to develop a grasp of the problem, from review articles rather than books.
In conclusion, after examining 33 publications published between 2017 and 2021, most studies used the faster RCNN and mask RCNN two-stage object detection methods.However, while comparing the performance of each deep learning model, it was shown that one-stage object detection, YOLO, outperformed two-stage object detection.In conclusion, the majority of publications advocate for data augmentation and resizing to combat overfitting and increase performance and decrease training time.

Figure 2 .
Figure 2. The flow diagram of research methodology


ISSN: 2302-9285 Bulletin of Electr Eng & Inf, Vol. 12, No. 2, April 2023: 1078-1091 1082 preparation, and deep learning model performance evaluation.The results presented were based on a review of previously published research on these topics.
depicts the number of studies conducted in agriculture over the last five years, from 2017 to 2021, including the use of deep learning for object detection.Most of the studies Bulletin of Electr Eng & Inf ISSN: 2302-9285  Systematic literature review: application of deep learning … (Ahmad Shukri Firdhaus Kamaruzaman) 1083 were published in 2020, a total of 16.In 2017, only 2 studies were found to do deep learning in the agricultural sector, which increased to 3 studies in 2018.

Figure 3 .
Figure 3. Number of selected studies in the past five years research have utilised two-stage target detection; 12 of them have employed mask RCNN, 9 have used Faster RCNN, and 3 have employed the modified deep learning model based on the current Faster RCNN model in terms of the backbone.One study used a modified mask RCNN.Table

Figure 4 .
Figure 4.One stage target detection Figure 5. Two stages target detection

Table 1 .
Advanced search string

Table 2 .
The criteria for inclusion and exclusion

Table 3 .
Summary of selected primary study

Table 5 .
Deep learning model approach

Table 6 .
Dataset preparation analysis

Table 7 .
Deep learning performance algorithm ). Next, most of the researchers' deep performance approaches achieve more than 80% accuracy, precision, recall, F1-score and AP.However, one paper (S19) has accuracy and recall below 80%, but it's still acceptable since their F1-score was 80%.It is becauseBresilla et al.

Table 8 .
Result based on performance metrics