Stacking ensemble learning for optical music recognition

ABSTRACT


INTRODUCTION
Music is often described as structured notes in time and musical notation is a music representation that visually communicates that definition of music [1].Music is an art of human culture that is passed down from generation to generation.Generational changes also make changes and developments in the music culture itself.The development of this musical culture eventually brought it to the stage it is today, where musical notation can be described using a very common notation, namely the common western music notation (CWMN).This music notation has become an international representation and at the same time the most common in representing music in writing.This musical notation eventually became a problem in the field of computer vision which had the basic idea of making computers able to recognize musical symbols in this musical notation, just like humans.This idea finally led researchers to various problems that must be solved, so that the computer can recognize and detect musical symbols well.This problem is known as optical music recognition (OMR).OMR is a field of study that studies and develops computer algorithms and models to recognize musical notation in a document [2], with CWMN as the most common musical notation used in the study.CWMN is composed of musical symbols representing music.Figure 1 shows several symbols used in  OMR has several benefits in everyday life.Music notation can be converted into several forms, such as musical instrument digital interface (MIDI) for playing the music and MusicXML for storing music notation in the form of sheet music documents.The benefits contributed by this OMR research greatly support musicians in practicing, exploring musical plays, and writing songs in CWMN musical notation.All the benefits provided by this OMR study are highly dependent on the model's performance in detecting and recognizing symbols in a musical score, which is determined by the method and the results of the classification of musical symbols.Therefore, it is imperative to determine the appropriate and accurate classification method to be used in OMR experiments.Determining a good classification method for OMR eventually becomes a problem in the field of study.
Researchers have conducted many studies and proposed various models capable of detecting and recognizing CWMN symbols.Mejía et al. [3] experimented with classifying music sheet images using several baseline architecture convolutional neural networks, namely VGG16, MobileNet, ResNet50, Inception V3, and Inception-ResNet-V2.The experiment shows good performances of the five models, with MobileNet achieving the highest accuracy.Other studies using datasets containing images of music scores or staves pieces have been carried out using several models or other techniques, such as a deep convolutional neural network (DCNN) using the darknet53 basic network on YOLO [4], deep watershed detector (DWD) [5], parallel bat algorithm [6], U-Net [7], [8], and several variations of the model using convolutional recurrent neural network (CRNN) [9]- [13].These models have shown good results in carrying out the OMR task.In addition to these studies, some studies perform classification tasks at the symbol level of CWMN.These studies are conducted using datasets that contain images of cropped musical notation symbols, where each image will only contain one symbol.Some studies used several variations of the feature extractor followed by the K-nearest neighbor (KNN) classifier [14], [15].There is also a study that applies a texture-based feature descriptor (daisy descriptor) that is optimized using quantum concept inspired gray wolf optimization and is continued by comparing the performance of several classifiers, namely multi-layer perceptron (MLP), KNN, Naive Bayes (NB), random forest (RF) and sequential minimal optimization (SMO) [16].Although these studies have produced good and very good results in some of the studies, these results can still be improved.
In performing classification tasks, the ensemble learning method has been widely used in other areas of classification problems and has produced excellent performance results [17]- [23], but as far as is known at the time of writing this study, there is only one study that applied this method to OMR [24].Ensemble classification is a learning method with a mining approach that utilizes various classifiers that distinguish class labels for unlabeled things from accumulation [25].The main idea of creating learning ensembles is to improve prediction performance by constructing multiple models or multiple predictions [26].Paul et al. [24], proposed an OMR ensemble learning using three pre-trained deep learning models, namely ResNet50, DenseNet161, and GoogLeNet as the base-classifier models.The base-classifier segment of the model is then followed by a support vector machine (SVM) meta-classifier.However, in this study, several DCNN models which are newer and superior to the three models were selected to be designed as ensemble models.Those models are ResNeXt50, Inception-V3, RegNetY-400MF, and EfficientNet-V2-S.
As the improved model of ResNet, ResNeXt can outperform ResNet in experiments [27], [28] and can perform classification well [29], [30].It has only a few hyperparameters to set and its cardinality can Bulletin of Electr Eng & Inf ISSN: 2302-9285  improve classification accuracy.Inception-V3 was introduced in [31] and proven to be able to do classification tasks very well [32]- [34].RegNet [35] and EfficientNet-V2 [36] can be said to be still fairly new models since they were first introduced in 2020 and 2021 respectively.Both models have been used several times in the study of classification and have also shown that both models can perform well on the given task [37]- [42].Therefore, based on those studies, these four models are chosen in this research to be contained as the ensemble base-classifier models for the OMR ensemble model.In addition, because there are no other studies that carry out ensembles on the OMR task, it is necessary to experiment and analyze the meta-classifier method that can perform well in the OMR ensemble model, which has not been done so far.Several machine learning techniques are chosen and analysed in this study, namely, SVM, logistic regression (LR), RF, KNN, decision tree (DT), and NB.There are also six musical notation symbols datasets used in this study, namely the handwritten online music symbols (HOMUS) version 2 [43], Rebelo1 [44], Rebelo2 [44], Fornes [45], OpenOMR [46], and PrintedMusicSymbols [47] datasets.These datasets are a collection of OMR datasets which do not contain music score images, but cropped CWMN symbols, so that these symbols are no longer placed on staff lines.These datasets are then combined, hence producing a unified dataset.Using the proposed model and the determined datasets, this study will only focus on boosting the performance result produced by the model without considering other assessment variables, such as time and resources required.All experiments built were run using the Python programming language on Google Colab Pro.The Python programming language version used is Python 3.8.15.The experiment was run using the available GPU on Google Colab Pro, which is one among the NVDIA Tesla K80, P100, and T4 randomly selected by the server.The results of the performance of each DCNN and the results of the ensemble of each meta-classifier in each dataset have been presented in this study.

METHOD 2.1. Overview
In this study, an ensemble learning design has been proposed to perform the OMR task.The designed learning ensemble consists of four DCNN architectures in the base-classifier segment and one machine learning classifier in the meta-classifier segment.Several DCNN architectures that are used as the base-classifier models are ResNeXt50, Inception-V3, RegNetY-400MF, and EfficientNet-V2-S.These models in the base-classifier segment will carry out the classification task independently.This classification process produces a collection of predictions produced by the four models of the base-classifier segment.This collection of predictions are given to the meta-classifier as the input data.The meta-classifier will once again perform the classification process and produce final prediction for each data in the dataset.Hence, ensemble prediction is produced.In this meta-classifier segment, a machine learning technique will be used as the meta-classifier of the proposed ensemble learning method.To determine the best machine learning technique to be applied to the ensemble learning model for the OMR task, several machine learning techniques, namely SVM, LR, RF, KNN, DT, and NB, are chosen to be analysed in this study.The designed ensemble learning model is tested against a unified OMR dataset which contains six OMR datasets that are publicly available, namely HOMUS_V2, Rebelo1, Rebelo2, Fornes, OpenOMR, and PrintedMusicSymbols.These datasets are combined, forming a unified dataset.As a result, in this study, six experiments have been carried out with each machine learning technique as the meta-classifier of the ensemble model.Figure 2 shows the workflow overview of the conducted experiments in this study.

Dataset
The dataset used in this study is a dataset that contains images of CWN music notation symbols.Both printed and handwritten symbols are used.Pacha and Eidenberger [48] managed to build a tool that has made a huge contribution to the OMR field.The tool is a python package called the omrdatasettools.This tool can retrieve several datasets created by several researchers, namely HOMUS_V2, Rebelo1, Rebelo2, Fornes, OpenOMR, and PrintedMusicSymbols datasets.Each dataset will be extracted to produce six folders representing each of the six datasets.Each of these folders contains various CWMN symbol images.HOMUS_V2 and Fornes are datasets containing handwritten musical notation symbols.Meanwhile, Rebelo1, Rebelo2, OpenOMR, and PrintedMusicSymbols are datasets containing printed music notation symbols.Table 1 shows the overview of each dataset mentioned.These datasets are then combined, forming a unified dataset.In the combining process, the dataset is analysed so that there is no overlap in the classes.Data that has the same class, but different class writings have been changed to be labeled as one proper class label and merged into the same class folder, resulting in a dataset containing 64 classes and 35,460 images of musical symbols.Due to the limited resources used in this study, this dataset was down sampled by determining that each class only accommodates a maximum of 300 images to ease the training process.This dataset will hereinafter be referred to as the Downsampled300Unified dataset.The Downsampled300Unified dataset contains 64 classes and 12,401 images of CWMN symbols (34.97% of the whole unified dataset).After the dataset is created, some classes have too little data.There even exists two classes that only have one sample.This can cause problems when splitting data into training, validation, and test set.Therefore, these classes, which count as many as 14 classes, were removed from the dataset.That way, the Downsampled300Unified dataset contains 50 classes and 12,256 CWMN symbol images (34.56% of the entire unified dataset) with the smallest number of samples being 73. Figure 3 shows several sample images of the Downsampled300Unified dataset.  2 shows the amount of data in each class.The dataset has an unbalanced amount of image data.This is left so to evaluate the model's performance in dealing with unbalanced datasets.Thus, no data augmentation process is carried out in this study.For the preparation of the experiment, the dataset used will be given a little pre-processing, namely changing the size of the image.Each musical notation image in the dataset will be resized to 299×299 pixels.This image size was chosen because it is the minimum size of the input image that can be accepted by Inception-V3.Other DCNN models are also capable of accepting input images of this size.The input image is not resized for each model to give the same treatment to all models so that the stacking ensemble concept can be fully applied.No further pre-processing method is required to enhance the images on the dataset.
The dataset is then split into train, valid, and test data with the portion of 60%, 20%, and 20% respectively.The amount of train data is set at 60% to prevent the training process from being too long since four DCNNs will go through the classification task.All these split data are then given to each DCNN in the base-classifier segment.

Ensemble learning
There are three commonly used ensemble methodologies, namely bagging, stacking, and boosting [25].In this study, stacking ensemble learning has been chosen to be applied to the OMR task.Stacking ensemble learning has been determined to be used because, in this type of ensemble learning, each base-classifier model will be given the whole data in the dataset, in contrast to bagging ensemble learning where each base-classifier model will only get a part or subset of the dataset used.This makes the result  ISSN: 2302-9285 Bulletin of Electr Eng & Inf, Vol. 12, No. 5, October 2023: 3095-3104 3100 obtained in [26] show that stacking ensemble learning can outperform other types of ensemble learnings in the problems studied.
The ensemble learning in this study is built using several DCNN models as the base classifier, namely ResNeXt50, Inception-V3, RegNetY-400MF, and EfficientNet-V2-S.Each model will produce its prediction for all musical symbol images in the given dataset.The prediction results of these models are very likely to have differences.Therefore, in the ensemble learning model, the class prediction results generated by all base classifiers will be used as knowledge and input for a meta-classifier.This meta-classifier will provide the final result of the classification process for all data in the given dataset.
The meta-classifier in this learning ensemble can be performed using various machine learning techniques.Therefore, further analysis is needed to find the best technique to be applied to OMR ensemble learning.Several machine learning techniques that excel in classification tasks have been determined to be used as meta-classifiers in ensemble learning, namely SVM, LR, RF, KNN, DT, and NB.Results of the ensemble learning classification performance of each meta-classifier in each dataset are then reported to determine which of the machine learning techniques produced the best result in the OMR task.

Evaluation
In this study, the data is split into training, validation, and testing data with percentages of 60%, 20%, and 20% respectively.The split dataset is then given to the ensemble model.The learning ensemble that is built in this study will be given several evaluations.In each classification process carried out by all base-classifier models, accuracy calculations will be carried out.This is done to provide a report on the results of the classification performance carried out by each base-classifier model on the data provided without any ensemble learning.
After the prediction results by the base-classifiers are given to the meta-classifier, the meta-classifier will produce the final prediction results for each data in the dataset.These predictions can be mapped in a multi-class confusion matrix that contains the true positives, true negatives, false positives, and false negatives values.Using these values, the accuracy and F1-score evaluation metric can be calculated.Accuracy score can be calculated as in (1).F1-score will be calculated for each class in the dataset as in (2).After F1-score of each class are calculated, then the average value of the f1-score which concludes the value of the f1 score for the entire ensemble model can be calculated as in (3).Since in this study six meta-classifier methods are analysed, therefore 12 evaluation metric scores are produced in the experiment.Using these evaluation metric scores, the best meta-classifier can be concluded.Meta-classifier that yields the highest evaluation metric scores is considered the best meta-classifier.
The built ensemble model is also evaluated by comparing the model with a comparative model.This is done by comparing the accuracy scores produced by the six meta-classifiers of both ensemble models.Therefore, the comparative ensemble model is also given several experiments using the determined six machine learning techniques as the meta-classifier of the ensemble model.In this way, the improvements provided by the built model ensemble can be clearly reported and assessed.

RESULTS AND DISCUSSION
This study has experimented with performing the proposed method of ensemble learning on an OMR task using the determined Downsampled300Unified dataset.The designed ensemble model consists of four DCNNs as the base classifiers, namely ResNeXt50, Inception-V3, RegNetY-400MF, and EfficientNet-V2-S, and one machine learning technique as the meta-classifier of the ensemble model.To determine the best machine learning technique to be used as the OMR ensemble model's meta-classifier, several machine learning techniques are determined to be performed in the ensemble model, namely SVM, LR, RF, KNN, DT, and NB.
The experiment is done to produce several results.The first result is the performance of the determined four base classifier models (the four DCNNs) on the dataset.The second result is the performance results of Bulletin of Electr Eng & Inf ISSN: 2302-9285  the ensemble learning model on the dataset using each meta-classifier.The last result is the best meta-classifier produced in this research through the performed experiment.The performance of each base classifier model is evaluated using the accuracy score.In the base-classifier segment of the ensemble model, each DCNN base classifier performed the classification task on the given dataset, the Downsampled300Unified dataset.This classification task can be evaluated using the accuracy score, thus the performance of each DCNN can be evaluated.Table 3 shows each DCNN's accuracy scores in classifying CWMN symbols in the Downsampled300Unified dataset.The experiment showed that EfficientNet-V2-S outperformed the other DCNN in terms of accuracy score.The model produced a 0.9735 accuracy score.The best accuracy score ranking was followed by Inception-V3 with an accuracy score of 0.9694, then ResNeXt50 with an accuracy score of 0.9670, and RegNetY-400MF with an accuracy score of 0.9625 respectively.After the four DCNNs' performances are evaluated, the ensemble model built will then be evaluated.This is done by performing six experiments since in this research, six machine learning techniques are determined to be used as the meta-classifier of the proposed ensemble model.The classification predictions of the meta-classifiers are evaluated using two evaluation metrics, namely the accuracy score and the F1-score.Table 4 shows the accuracy and F1-scores of each meta-classifier in performing the classification task using the Downsampled300Unified dataset.Through the carried-out experiment, it has been obtained that the proposed ensemble learning model has produced outstanding evaluation metric scores.Almost all the evaluation metric scores achieved 0.97 scores with the best value of 0.9751 in terms of accuracy score and 0.9752 in terms of F1-score, both of which were achieved by the LR meta-classifier and also KNN in the accuracy score.This indicates that the proposed ensemble learning model succeeded in achieving a good performance in completing the OMR task.Using this evaluation metric results, the best meta-classifier can be determined.The conducted experiment has shown that LR produced the highest accuracy and F1-score with slight differences from the other meta-classifiers accuracy and F1-scores.Thus, LR succeeded in achieving the best meta-classifier perfectly.
The proposed ensemble learning model is then compared with the ensemble learning that has been proposed by [24].Therefore, an ensemble model that uses ResNet50, DenseNet161, and GoogLeNet as the base classifiers is also built.The model is also performed in six experiments using each determined machine learning technique determined in this study as the meta-classifier.The experiment is also done using the same dataset, therefore there is no difference in the data volume between the dataset used using both ensemble models.This model is used as the comparative model.The comparison process is done by comparing both models' accuracy scores using each meta-classifier.Table 5 shows the comparison result between the proposed model and the comparative study.
Through this comparison process, it can be seen proved that the proposed model outperformed the comparative model in every meta classifier's ensemble accuracy.The biggest accuracy score margin occurred in the DT meta-classifier, with a margin of 0.0163.the second and the third biggest margin occurred in the RF and NB meta-classifier with the margin of 0.0082 and 0.0077 respectively.The fourth and fifth biggest margin occurred in the LR and KNN meta-classifier with the margin of 0.0053 and 0.0049 respectively.The smallest margin occurred in the SVM meta-classifier with the margin of 0.0029.

CONCLUSION
This study aims to build model ensembles with newer and more robust base classifier models and to analyze several machine learning techniques to be used as the ensemble learning model's meta-classifier for the OMR task.The study only focused on boosting the performance result produced by the model without considering other assessment variables, such as time and resources required.A stacking ensemble learning model has been designed to use four DCNNs, namely ResNeXt50, Inception-V3, RegNetY-400MF, and EfficientNet-V2-S, followed by a machine learning meta-classifier.Six machine learning techniques are determined to be analysed, namely SVM, LR, RF, KNN, DT, and NB.The model is tested against the Downsampled300Unified dataset.
The proposed model has succeeded in increasing the ensemble learning performance score compared to the previous study.It is proven that several models that are proven good to be used on the classification task can improve the performance of an OMR ensemble learning model, such as ResNeXt50, Inception-V3, RegNetY-400MF, and EfficientNet-V2-S.As a result, the proposed ensemble learning model succeeded in achieving outstanding evaluation metric scores in completing the OMR task.
Through the conducted experiment performed using only the base-classifier models, it is shown that EfficientNet-V2-S outperformed the other models with a slight difference in the accuracy score.The model succeeded in obtaining an accuracy score of 0.9735.The experiment on the ensemble model has also shown an outstanding result.Almost all the evaluation metric scores achieved 0.97 with the best values of 0.9751 and 0.9752 in terms of accuracy and F1-score, respectively; both of which were achieved by the LR meta-classifier.
This study has succeeded in building an ensemble learning for the OMR task and has produced very good results.The study has also analyzed machine learning techniques that are good for use as a meta-classifier of OMR ensemble learning.This study opens business opportunities to create a music notation recognition software or application that is capable of accurately recognizing musical symbols.So that musicians will be supported in doing their work.Academically, further OMR research that is built using ensemble learning can use this study as a reference.In addition, because the OMR task is similar to the OCR task, the ensemble model proposed in this study can also be built for OCR research.
Although the proposed model has produced good results, there are still some things that can be improved in further research.In this study, the four DCNNs used are the lowest architectural structures of their kind.This is done to ease the process of running experiments on the ensemble model.In the next study, this problem can be experimented with the same DCNN model but using a more complex architectural arrangement, such as using the large or medium version of EfficientNetV2.In addition, other DCNN models that are proven to be faster and do not reduce model performance can also be used.This will make the research not only focused on boosting the evaluation metric results but can also consider the time costs incurred by the model.

Figure 3 .
Figure 3. Sample images of the Downsampled300Unified dataset

Figure 4 .
Figure 4. Distribution of the number of images in each class in the Downsampled300Unified dataset

Table 2 .
The amount of data in each class

Table 3 .
Base classifiers accuracy scores on the Downsampled300Unified dataset

Table 4 .
Proposed ensemble model's evaluation metric scores using each meta-classifier

Table 5 .
Comparison between the proposed model and the comparative study