Assessing the performance of YOLOv5, YOLOv6, and YOLOv7 in road defect detection and classification: a comparative study

Road defect inspection is a crucial task in maintaining a good transportation infrastructure as road surface distress can impact user’s comfortability, reduce the lifetime of vehicles’ parts


INTRODUCTION
Roads are vital means of transportation in many parts of the world.Various materials are used to construct road pavements, including porous asphalt, stone mastic asphalt and gap graded asphalt, among others.Asphalt is prone to deficiency due to various factors, like being exposed to water and surrounding temperatures, excessive traffic loads, execution mistakes, and lack of maintenance [1].There are four classifications of different types of defects: pavement cracks, surface deformations, disintegrations, and surface defects.The size and shape properties of road defects can be used to classify them into different categories.They can also be broken down into three severity categories, with mild, moderate, and high severity defects being assessed [2].Knowledge of the different types of road defects can lead to a better understanding of the probable causes and treatments for defects [3].As road pavement serves the purpose of having a smooth and comfortable ride and providing surface resistance for safety purposes, any deterioration on its surface must be detected in the early stages for rapid treatment.Road distress identification is also essential to determine the type of maintenance planning needed.There are three categories of detection techniques for road distresses in Malaysia: manual, semi-automatic, and automatic [4].In recent years, machine learning and machine vision have been adopted in various industry sectors.As it has many benefits in terms of productivity, efficiency, and flexibility with its usage, various fields of study have applied machine learning and machine vision.Despite having various benefit with the developing technologies, some ISSN: 2302-9285  Assessing the performance of YOLOv5, YOLOv6, and YOLOv7 in road … (Najiha 'Izzaty Mohd Yusof) 351 challenges can also be noticed in the implementation of machine vision in road defects detection, such as hairline cracks that are difficult to be detected, limitation in detecting cracks edge, as well as lack of cracks data quantification for further road maintenance purposes.Recent research on transportation engineering has already explored the application of machine learning technology in detecting road pavement deterioration.convolutional neural network (CNN), artificial neural network (ANN), K-means cluttering and regression are some of the most widely used methods thanks to their excellent performance [5].
The main purpose of object detection in road distress inspection is to detect the road defects in the images taken from the inspected roads and correctly classify them according to their types.There are many promising methods of object detection algorithms that are readily available to be adopted.The foremost commonly used approaches are you only look once (YOLO), single-shot detector (SSD) and CNN [6].CNN is one of deep learning algorithms, which aid in parameter identification by separating image into layers so that each layer is examined and may be interpreted more precisely than the standard analysis approach [7].Typically, CNN is constructed by incorporating the input, convolutional, pooling, fully connected, and output layers.A network with three convolution layers, two fully connected layers, and two neurons at the output layer since the number of classes needed are for crack and non-crack output [8].The CNN developed was tested on two different datasets, one obtained from CrackTree200 dataset with an accuracy of 96.99%.At the same time, the another was a self-collected dataset with the highest accuracy of 98.8%.Ma et al. [9] tested YOLOv3, YOLOv4s-mish, and YOLOv5s models on timber structures cracks, where YOLOv3 was shown to have the best performance in terms of precision with the mean average precision (mAP) value of 95.5%, while YOLOv5s with mAP value of 92.9% had the fastest training speed because it has the simplest network structure.Meanwhile, Yan and Zhang [10] proposed an algorithm of an improved SSD network by adding a deformable convolution to the backbone feature extraction in detecting asphalt pavement highway crack, resulting with a mAP of 85.11% which is 3.1% higher than the original SSD network.
Horvat et al. [11] utilized all of YOLOv5 models to detect face mask in images with a relatively longest training time of 8.67 hours for the YOLOv5x model while having the best performance of 77.1% mAP score.Another YOLOv5 based study introduced by Yu [12], a threshold segmentation method based on Otsu maximum inter-class variance was adopted to the dataset before being trained on YOLOv5-s model.The improved detection achieves 84.37% precision as K-means method has been adapted.Next, Aburaed et al. [13] evaluated the performance of YOLOv6 compared to YOLOv5 on detecting craters, where the claims that YOLOv6 would outperform YOLOv5 still can't be proven as their performance was inconsistence in every scenario.Meanwhile, Yang et al. [14] proposed a three-stage crack location and segmentation method where it is first filtered by the Retinex method to remove redundant noise, followed by detection process where YOLO-SAMT was introduced, and lastly processed by K-means clustering to extract the cracks.YOLO-SAMT is an enhanced algorithm where YOLOv7 architecture is integrated with SimAM and transformer, which shows a 5.42% higher mAP score than the original YOLOv7.Meanwhile, road damage detection and classification on google street view data using YOLOv7 with a label smoothing technique that resulted in higher F1 scores of 81.7% [15].
The detection and classification of road defects using object detection algorithms such as YOLOv5, YOLOv6, and YOLOv7 face several challenges.Limited availability of high-quality training data, variations in lighting, weather conditions, and road surfaces, and the difficulty in accurately distinguishing between different types of road defects are some of the critical issues to consider.In this context, the objectives of our paper are to evaluate and compare the performance of these algorithms in terms of accuracy, speed, and resource usage, investigate the impact of different data augmentation techniques, explore the use of inference and fine-tuning to improve the accuracy and assess the potential of these algorithms for real-time road defect detection and classification.By addressing these objectives and challenges, this research could contribute to improving the effectiveness and efficiency of road defect detection and classification using object detection algorithms.
This paper is structured into 5 main sections.The section 2 provides an overview of the evolution of the YOLO object detection algorithm, focusing on the YOLOv5, YOLOv6, and YOLOv7 variations.Section 3 outlines the methodology used in this study, including data collection and experimental setup.Section 4 presents the results of the experiments conducted and includes a discussion of these results.Finally, section 5 offers concluding remarks and summarizes the study's key findings.

EVOLUTION OF YOLO
YOLO was first introduced in 2015 with the release of "You Only Look Once: Unified, Real-Time Object Detection" paper with main purpose to eliminate multistage of training classifier on bounding boxes and refining them by only executing a single stage of object detection, while ramping up the inference time [16].Since the release of the first YOLO version, a series of YOLO updated variants has been published by few different scholars with each has its own significant upgrades and features.Following the first version,  [16].Bochkovskiy et al. [17] continued the variations with the release of YOLOv4 in 2020 as well as YOLOv7.These four versions are established as the official YOLO version, while a lot of other YOLO models such as YOLOR, YOLOX, PP-YOLOE, YOLOv5, and YOLOv6 are labelled unofficial as they are published by other researchers.Among those, a few have more popularity among end users; for example, YOLOv5, published in 2020 by Ultralytics and YOLOv6, released by Meituan Inc in 2021 has comparatively higher performance with its anchor-free method.Few past researches are also published in analysing the performance of YOLO models.Jiang et al. [18] compared the differences and relationship of YOLOv1 until YOLOv5 architecture and relativity, where YOLOv4 and YOLOv5 having similar and the highest performance in terms of speed and accuracy at that time.Thuan [19] in his article also concur to the comparison, while expecting more performance value of YOLOv5 as it was newly released at that time.In this paper, the three versions of YOLO; YOLOv5, YOLOv6 and YOLOv7 models, are adapted to compare their performance on road cracks and potholes detection and classification.
YOLO was initially developed to use bounding boxes with a corresponding threshold value to precisely detect objects on images using a model grid cell.YOLOv1 architecture started with the design of Darknet architecture with 24 convolutional layers followed by two fully connected layers inspired by GoogleNet [16].In the process of improving the algorithm, YOLOv2 was invented with the addition of batch normalization and higher resolution input, as well as replacing the fully connected layers with anchors boxes, which improved the recall by 7% and mAP by 2% [20].The model is then being developed more with the creation of YOLOv3 with a more powerful backbone, DarkNet-53, with 53 convolutional layers.It eliminates the usage of softmax classifiers, which limits the overlapping boxes, and adopts a logistic regression [21].Bochkovskiy et al. [17] design the enhanced YOLOv4 architecture with the new backbone, combination of cross stage partial network (CSPNet) and Darknet, CSPDarkNet-53, consists of 29 convolutional layers with the addition of spatial pyramid pooling (SPP) block, as well as mosaic data augmentation that uses 4-image mosaic instead of 1 image during training.

YOLOv5, YOLOv6 and YOLOv7 algorithms
Similar to YOLOv4, YOLOv5 uses CSPDarkNet-53 as its architecture backbone, path aggregation network (PANet) as the neck to improve the effectiveness of data transfer inside the model, and with the addition of a focus layer that replaces the YOLOv3's head layers.However, the developer, Ultralytics, has not released any paper on the model.Even though there are only a few improvements in YOLOv5 architecture compared to YOLOv4, it is the first ever model that implemented PyTorch instead of DarkNet, where PyTorch framework is more user-friendly with language that is widely use in current machine learning technology.Furthermore, with the implementation on the focus technique, YOLOv5 models are 90% smaller than YOLOv4, thus marks a much faster training speed without impacting the mAP score [22].Figure 1 presents the overall network architecture of YOLOv5 where it consists of three main parts: CSP-Darknet as the backbone, PA-Net as neck, and YOLO layer for the head.CSP-Darknet is a cross stage partial network strategy that is used to help in minimising the excessive amount of duplicate gradient information from usage of residual blocks.This strategy makes YOLOv5 having a faster inference speed due to a smaller number of parameters and computation used.PANet is a feature pyramid network that is utilized in the neck part where it improves in pixels localization.The head of the network for YOLOv5 is similar to YOLOv3 and YOLOv4 where it consists of three convolutional layers that is crucial in calculating the bounding boxes coordinates.
In 2021, Meituan Inc published YOLOv6, designed mainly for industrial applications purposes, also written in PyTorch, is anchor free, and has a reparametrized backbone called EfficientRep where RepVGG is used for nano and small models, while CSPStackRep is used for medium and large models.The neck structure is similar to YOLOv5 with a bi-directional concatenation (BiC) for more localization accuracy, with a decoupled classification and detection head.Overall, YOLOv6 delivers a better result than the former versions in terms of its accuracy and is 51% faster compared to previous anchor-based models [23].Figure 2 represents the overall network architecture of YOLOv6 [23].
YOLOv7 was released with the publication of the paper, entitled "Trained bag-of-freebies sets new state of the art for real-time object detectors," which revealed a new change of the model architecture by integrating the extended efficient layer aggregation network (E-ELAN) by grouping computational blocks while not changing the transition layers.The architecture is also scaled by concatenating the previous YOLO models for the purpose of inference speed adjustments, as seen in Figure 3.The overall improved architecture of YOLOv7 gives an increasing detection accuracy as well as speed [24].
The overall comparison of the development of YOLO architecture from YOLOv5 up to YOLOv7 can be observed in Table 1.Meanwhile, Figure 4 represents the average precision (AP) curve of YOLO models, where YOLOv7 achieved the highest performance in terms of speed as well as precision [24].The

. Data acquisition and pre-processing
The images used in this work were acquired using a GoPro Hero 8 camera mounted behind a car, as illustrated in Figure 5. GoPro Hero 8 offers advantageous features such as image stabilization, lightweight, high-resolution image produced, and practicality.A good image stabilization helps as the camera was mounted on a moving car.GoPro Hero 8 is also practical to be mounted on a car since it is light with 117g weight and small.Its dimension is 6.2x3.2x4.5 cm.For the data collection, the camera was set to video mode with a 1920x1080 pixels resolution at 24 fps.A linear digital lens was chosen to minimise the barrel effect.The camera was set at a 160 cm height to allow it to capture the road surface at a width of 3.1 m, considered the largest typical width of a road.

Figure 5. Camera setup on vehicle for data acquisition
Videos of the road were captured with format of mp4 for the duration of 5 to 10 minutes at a maximum speed of around 30 km/h.Images were extracted and saved from the videos in jpg format with a resolution of 1920x1080 pixels.A total of 8396 images were extracted from all the videos acquired during data collection, and after manually filtering out images without any visible road defects, 3328 images remained.
Roboflow was chosen as the primary tool to annotate the images, split them, then to augment them.The images were annotated manually using the bounding box features.The annotated defects were split into four classes which are crocodile cracks, longitudinal cracks, transverse cracks, and potholes.The image dataset was then split into train, validation and test sets at the ratio of 7:2:1.The images were then augmented by flipping them in both vertical and horizontal axis resulting in a total dataset of 4788 images split into 4000 training images, 533 validation images and 255 test images, with final image resolution of 640x360 pixels.
The defects to be detected from the images were classified into four classes: crocodile crack, longitudinal crack, transverse crack, and potholes.The sample of the images containing these four classes can be seen in Figure 6(a) for crocodile crack, longitudinal crack as in Figure 6(b), potholes as in Figure 6(c), and lastly transverse crack as in Figure 6(d).

Deployment of YOLO models for road crack detection
YOLO models have been chosen because of their proven fast inference speeds and high accuracies.In this work, the performance of YOLOv5, YOLOv6 and YOLOv7 models in road crack detection have been evaluated and compared.The YOLOv5, YOLOv6 and YOLOv7 models were obtained from github.com/ultralytics/yolov5,github.com/meituan/yolov6and github.com/WongKinYiu/yolov7,respectively.They were trained using the prepared dataset described in the previous section.Google Colab was used for training the models, which offers high-performance GPUs.Roboflow was used to annotate the images, augment selected images, and create the configuration files for model training purposes.The training for each model was completed after 100 epochs.Finally, the inference was also done in Google Colab, although it could have been done locally on a typical laptop and does not require a high processing power.To find the best performing model in terms of both speed and accuracy, many models of YOLO architectures were investigated, which include YOLOv5-n (nano), YOLOv5-s (small), YOLOv5-m (medium), YOLOv5-l (large), YOLOv5-x (extralarge), YOLOv6-n, YOLOv6-s, YOLOv6-m, YOLOv6-l, YOLOv7-tiny, YOLOv7 and YOLOv7-x.
The results obtained from each run were evaluated in terms of precision and accuracy.At the end of each training run, the results were saved, and they include precision, recall, mAP and its mAP at different IoU thresholds ranging from 0.5 to 0.95.The main parameters that need to be focused on are accuracy and mAP@0.5, which is the mean average precision.Meanwhile, as the accuracy result is not included in the data results, it must be calculated using each training run's confusion matrix.The calculations for each of the results are as in (1)-( 5): Where TP is true positive, TN is true negative, FP is false positive, FN is false negative, and AP is average precision.

RESULTS AND DISCUSSION
Table 2 shows the performance results of all the models trained in this work.Since all models were deployed using the same instances and dataset for each run, the results can be analysed comparatively.It can be seen that YOLOv7-tiny has the shortest training time despite being in a bigger class range compared to YOLOv5-s and YOLOv6-s.To compare relatively each model to their respective size, YOLOv7 still has an YOLOv7 model also records the highest accuracy with 87.16%.Among YOLOv5 models, YOLOv5-l sets the highest performance with 78.9% mAP score and 85.65% accuracy, while for YOLOv6 models, YOLOv6l take place with a mAP value of 72.32% with a higher accuracy of 86.9%.Even though based on the evaluation, YOLOv5-l model has a higher mAP of 78.9% compared to YOLOv5-x, 78.3% as can be seen in Figure 7(a), it can also be observed that YOLOv5-x model has the best performing parameter compared to the other models as it maintains the highest curve throughout the whole run.Meanwhile, from Figure 7(b), YOLOv6-l exhibits the best performance out of the four models.Lastly, YOLOv-7 and YOLOv7-x increase with a similar performance throughout the run while YOLOv-7 outperforms the other on the last few epochs, as shown in Figure 7(c).To evaluate the results, the best models obtained in the training run, the best models were tested further by inferencing other 255 test images to validate the YOLOv5-l, YOLOv6-l and YOLOv7 best models.The speed of the inference run for all best models are recorded in Table 3, with YOLOv7 shown to have the fastest speed.Four sample result images of each best models were compared based on the detection of the crack classes.The confidence score is displayed on the bounding boxes to analyse the models' inference performances, besides the accuracy of identified cracks to their labels.Figures 9 to 12 display the sample inferred images on different type of cracks detected.Figures 9(a) to (c) show the comparison of the confidence score of YOLOv5-l, YOLOv6-l and YOLOv7 in detecting an obvious crocodile crack, where all models give a same high score of 0.98.Figures 10(a) to (c) discussed on the accuracy of detecting multiple cracks on one image and it shows that YOLOv5-l manages to detect the second longitudinal crack that the other 2 models have not detected, as well as having a comparatively higher scores for longitudinal crack and pothole detected.Meanwhile, Figures 11(a) to (c) compares the images with combination of crocodile and transverse cracks which show that the best result is from model YOLOv5-l and YOLOv7 where they have a similar confidence score, with YOLOv5 having a 0.07 score higher in detecting transverse crack.While having a rather lower confidence score in detecting the cracks among all models, YOLOv6-l unexpectedly detected the transverse crack, as shown in Figure 12(b), where the other two models did not detect the obscure cracks at all as seen in Figures 12(a) and (c).From this comparison, it can be concluded that even though YOLOv5-l and YOLOv7 has a very similar performance in inferencing the images, YOLOv5 has the upper hand in the confidence score.

CONCLUSION
This paper evaluated the performance of three YOLO models, which are YOLOv5, YOLOv6 and YOLOv7, in detecting and classifying road defects.It was observed that model YOLOv5-l and YOLOv7 have the best implementation among all the 12 models assessed, with a very similar performance.In terms of training execution over a training dataset of 4000 images, YOLOv5 had a training time of 4.92 h, while YOLOv7 trained for 5.7 h, and they evaluated mAP@0.5 score of 78.9% and 79.0% respectively.This shows that YOLOv5 has an upper hand in terms of training performance, as they both resulted a similar precision.In the matter of inferencing process to detect the cracks, YOLOv5 has an inferencing speed of 0.97 minute while YOLOv7 records the speed of 0.47 minute for a total of 255 test images dataset, while they were evaluated with comparison of confidence score where YOLOv5 has higher points.It shows that even though YOLOv7 can perform the inference process at two times faster speed compared to YOLOv5, in terms of accuracy and precision of the detected cracks YOLOv5 still has the advantages.Nonetheless, due to the resource limitations, such as restricting the training run to only 100 epochs and utilizing a dataset comprising only 640 x 360 resolution images and the total images work on was less than 5000, the results were confined to a single discrepancy.To improve upon these findings, future research could entail working on expanded YOLO models and using higher resolution images in conjunction with a variation of epochs number training run.Furthermore, potential pre-processing steps could be implemented on the dataset, and the difference in the dataset inference on images with varying lighting could also be explored.

ACKNOWLEDGEMENT
The authors would like to thank the Malaysian Ministry of Higher Education (MOHE) for financing the research project through the FRGS grant FRGS/1/2021/TK02/UIAM/02/4.We would also like to express

ISSN: 2302- 9285 
Assessing the performance of YOLOv5, YOLOv6, and YOLOv7 in road … (Najiha 'Izzaty Mohd Yusof) 353 following sections present the methodology of implementing the three selected YOLO models; YOLOv5, YOLOv6 and YOLOv7 in detecting and classifying road defects.

Figure 4 .
Figure 4. Comparison of YOLO models performance based on AP curve

Figure 6 .
Figure 6.Image samples of road defects captured for each class; (a) crocodile crack, (b) longitudinal crack, (c) pothole, and (d) transverse crack

Table 2 .
Training performance results for YOLOv5, YOLOv6 and YOLOv7 models

Table 3 .
Testing speed for inferencing 255 test images using YOLOv5, YOLOv6, and YOLOv7 best model