An Adam based CNN and LSTM approach for sign language recognition in real time for deaf people

ABSTRACT


INTRODUCTION
In recent years, artificial intelligence (AI) has made people smarter by making it easier for them to process information, make decisions, and do tasks.In addition, it can help with medical diagnosis and treatment [1] as well as human development through education and training initiatives [2].It can also be utilized to aid disabled people in various ways.For example, using an IoT-based deep learning approach to provide indoor thermal comfort for disabled people [3], assistive technologies such as text-to-speech, speech recognition, and natural language processing can benefit people who struggle with communication.People who have trouble moving around can get help with physical tasks from AI-powered robots and drones [4].AI can also make personalized programs to help people with physical and neurological disabilities get better.Also, chatbots, virtual assistants, and other interfaces powered by AI can make it easier for people with cognitive or developmental disabilities to get information and do tasks.
Deaf and hard-of-hearing people communicate with others or their community primarily through hand and body gestures in sign language (SL).It differs in vocabulary, meaning, and grammar from spoken and written language.Saying makes clear sounds linked to specific words and grammatical structures to send messages [5].Visual hand and body motions are used in sign language to communicate important messages.People who are deaf get stressed because they can't use all services well because of communication problems [6].Thirty-four million children and 432 million adults worldwide require rehabilitation for "disabling" hearing loss.A debilitating hearing loss is predicted to affect more than 700 million individuals by 2050, or 1 in 10 people.Large segments of society value sign language as an alternative means of communication.According to the International Federation of the Deaf, there are 70 million deaf people in the globe who use more than 300 sign languages [7], [8].
Each sign in a given sign language differs in terms of hand form, motion profile, and placement of the hand, face, and other body parts.Because of this, they think visual sign language is a complex area to study in computer vision [8].Materially, it is a multidisciplinary topic that still needs to be solved in depth to fully understand what people are trying to say [9].Several new AI technologies have been used to understand sign language in the past few years.According to recent studies, this can be performed in a few different ways [7]- [9].One way is to use computer vision and machine learning algorithms to look at videos of people using sign language.These algorithms can be taught to recognize certain motions and signs to translate sign language into text or voice.Another method is to employ wearable technology, such as gloves or other accessories with sensors, to track the movements of the hands and fingers.These sensors can be used to track what the hands and fingers are doing, which can then be used by machine learning algorithms to recognize sure signs and signals.Due to several factors, sign language recognition is about 30 years behind voice recognition systems.One of the leading causes is that two-dimensional video signals require far more processing and recognition effort than one-dimensional audio signals.Also, the vocabulary and meanings of sign language have yet to be entirely found, and there are no dictionaries.Other than these, there are no universal definitions for many such signs.Games, virtual reality settings, robot control, and natural language conversations are just a few areas where these technologies can be successfully used [9], [10].
The sign language dataset is taken from a webcam and specifically from the droid cam client server to integrate the method with the phone [11].Two datasets are prepared for this study, DS-1 and DS-2.The DS-1 set contains three classes, and the DS-2 set includes twenty-six types with a total 28,600 number of images.DS-1 is used to fit both long short-term memory (LSTM) and gated recurrent unit (GRU) for classifying three different classes "hello," "iamhungry," and "thanks."LSTM shows the utmost exactness on average for all performance measures (accuracy, precision, recall, and F1-score) compared to GRU.On the other hand, the DS-2 dataset has been utilized to prepare convolutional neural network (CNN) model.The 89.07% accuracy has been achieved from the CNN model generated by the proposed dataset.On the other hand, LSTM displays 94.3% accuracy.

LITERATURE REVIEW
In several previously published publications were dissected and analyzed.Huang et al. [10] talk about how the Kinect and a CNN can be used to recognize 3D sign language.The authors used 3D CNN to get spatial and temporal information from raw data to find natural features that could be used to adapt to the significant differences in hand movements.Moreover, a realistic dataset having twenty-five signs is used in their study.Pigou et al. [12] worked on the CNN-based recognition system and Microsoft Kinect was provided.Thresholding, background erasure, and median filtering were employed in this system's preprocessing.They used the accelerated gradient descent (NAG) optimizer that Nesterov made, which was very good at recognizing movements related to the Italian language.Wang et al. [13] used multidimensional hidden Markov models (HMMs) to identify the sign using a sensory globe and the bird's motion tracker.Support vector machine (SVM), k-nearest neighbor (KNN), logistic regression, and CNN are a few techniques that can be utilized to design a strategy that will make it much simpler for a non-signer to communicate with a signer, according to Priya et al. [14].They contain the entire American sign language (ASL) grammar, consisting of 26 letters and ten numbers.The accuracy of the experimental results-80.30%for SVM and 93.81% for deep neural networks (DNN)-was encouraging.Srininvas et al. [15] said that to recognize the sign language system for the deaf and dumb, SVM, and ANN should be used for 26 different classes.
Hein et al. [16] implemented their way with two segments: a training segment and a classification segment.First, a webcam input video was taken for the training part.Then, using a component detection approach, we found the component.After finding the details, they use the localization method we suggested Bulletin of Electr Eng & Inf ISSN: 2302-9285  An Adam based CNN and LSTM approach for sign language recognition in real … (Subrata Kumer Paul)

501
to see where the head and hand are.Elmezain et al. [17] have also used a real-time system to create work based on the hidden Markov model (HMM) that can automatically recognize Arabic digits (0-9) in both isolated and continuous gestures.Other HMM topologies with separate states, such as ergodic, left-right (LR), and left-right banded (LRB), can also handle single gestures.Chakraborty et al. [18] put the English alphabet into groups based on how different hand gestures in the Indian sign language (ISL) show it.They did this by using Google's media pipelining API.Using this API, one can find the x, y, and z coordinates in three-dimensional space for each of the 21 landmarks on each hand.They found that by using the media pipe API, they could accurately predict the ASL and a few other SLs.In terms of accuracy, SVM has been compared to the random forest (RF), the KNN, and the decision tree (DT).SVM is the most accurate, with a 99% accuracy rate.Shankar et al. [19] say that object identification is the most common way computer vision is used.It is a technique for locating and modifying objects like furniture and artwork.Although numerous detection methods exist, their degrees of precision and effectiveness still need improvement.Phi et al. [20] developed a glove-based gesture recognition system using ten flex sensors and an accelerometer.
Fang et al. [21] used data gloves and 3D position trackers to make a Chinese sign language recognition system that could understand 91.9% of the words in 1,500 test phrases.McGuire et al. [22] used a onehanded glove-based system and a hidden Markov model.This study has shown that using sensors and equipment to recognize sign language is expensive and needs to be more user-friendly to be portable.

METHOD
Figures 1 and 2 can be used to show how the proposed method works.The dataset for the first proposed model diagram was obtained from the media pipe holistic pose estimation.Essential points are collected by following the process described in Figure 1.Using a holistic media pipe library, points are accumulated from hand, face, and pose landmarks.After preprocessing several frames, the arrays are transferred into LSTM and GRU models for the training and testing phases, where the trained model is used to assess the prediction made from the test data.Our model has been introduced with several epochs for accurate recognition to gather the expected results acquired in our second walkthrough.Two different layers with different sizes for our LSTM models and batch normalization have been added before the top layer.Our LSTM follows the CNN-time-series divided parasitic capacitance (TSDPC)-LSTM framework proposed by Tang et al. [23].But instead of passing important frames from the video, LSTM takes a series of essential points in a 2D array.On the other hand, our proposed GRU follows the architecture of Haque et al. [24], where the sequence of a 2D array of critical points is also passed, excepting features of the frames.Dropout layers with a dropping rate of 0.5 were added to prevent overfitting the model.Additionally, the hidden and output layers defined the sigmoid and relu activation functions.The "SoftMax" activation function has been added to the output layer, and sparse categorical cross-entropy has been chosen as the loss function.In our model, adaptive moment estimation (Adam) [25] is regarded as an optimization method.The input shape is computed as (30 and 1660).For data acquisition, droid cam and 120×720-pixel camera devices have been used to acquire the images.But the number of pixels in the droid camera is higher because it comes from a phone.The method shown in Figure 2 has been used to get the results we wanted for classifying the 26 letters of the alphabet.The images are resized to 50×50 pixels, and several preprocessing steps are followed.Also, each of the photos is segmented through the collected images.After finishing the preprocessing steps, the images are fed 502 into a CNN model and split between train and test with the "processed train" dataset.Finally, the recognition is made using the independent test dataset.The architecture of the proposed CNN model is inspired by ResNet50 [26].However, for our proposed CNN model, we used the architecture of improved ResNet50 introduced by Wu et al. [27].

Dataset description
The proposed method depends on deep learning techniques known as LSTM and CNN.The LSTM method is tested on our datasets of pose estimation data from different webcams.Three types of images are stored in three folders, being created named "hello," "iamhungry," and "thanks" to export the data to be stored as NumPy arrays (x, y, and z-axis of hand pose) and create labels.This dataset is recognized as a DS-1 set.Therefore, the DS-1 set is about the sequence of frames related to each category.
The second method is based on a convolutional neural network, where a dataset of 28,600 images is also collected by us using several webcams, each consisting of 26 different categories stored in 26 other folders with A-Z alphabets.Again, each folder contains two subfolders or subsections.The first section of the folders includes the raw image files, and the second section holds the preprocessed images.This dataset is recognized as a DS-2 set.There are images of 26 alphabets in the DS-2 collection.

Long short-term memory
The LSTM is made to reduce back-flow issues [28].Hochreiter and Schmidhuber [28] created the LSTM algorithm, a modified recurrent neural network (RNN), and resolve the aforementioned error backpropagation problems.Cells, input gates, and output gates were the only components of the LSTM's initial implementation.
Yet, the effectiveness of RNN [29] declines as the gap duration increases.Making a choice on the removal of unneeded information from the cell state is important for the first stage.The "forget gate layer," one of the sigmoid layers, handles such decisions.In making decision, x  and C −1 are depicted in Figure 3, and the results for all of the numbers in C −1 cells might be any value between 0 and 1.If the output is a '1', it means that the information needs to be saved, whereas a '0' means that it needs to be deleted.After that, it is now necessary to plan what data should be kept in the cells.There are two steps in this process.First, the values that need to be updated are resolved by the gate layer, which is also a sigmoid coat layer.Second, in this state, a tanh layer created just for adding generates a new character's vector, i.e., t.This step involves execution once all planning and decision-making has been completed [30].The following (1)-( 3) can be used to compute the f t , i t and Ct_hat [28].4)-( 7):

RESULTS AND DISCUSSION
For calculating some of the performance matrices like average precision, recall, fl-score, and overall accuracy, initially, the confusion matrix is produced.The following has been provided as the calculation formula for each of these phenomena.True positive (TP) results are those where the model accurately identified one class or positive class.A true negative (TN) is a result for which the model correctly predicts another class or negative class, much as a genuine positive.When the model predicts the positive class incorrectly, a false positive (FP) is produced.A false negative (FN) is a result that occurs when the model incorrectly predicts the negative class.In ( 8) to ( 11) are used to compute precision, Fl score, and accuracy from the confusion matrix.Tables 1-5 display the performance of several models created for this study.While precision frequently employs (9) to calculate the TP rate, recall frequently uses (11), which focuses on reducing FN rates [32].A confusion matrix for LSTM and GRU is a 3×3 matrix [33] for showing how many positive and negative predictions are correct and incorrect for three classes "hello" (0), "iamhungry" (1) and "thanks" (2).Its confusion matrix illustrates whether the classification algorithm correctly or incorrectly classified the records into positive and negative classes.Each of the scores is evaluated for the measurement of further analysis: the confusion matrix of the LSTM, GRU, and CNN model is shown to calculate the model's assessment in terms of accuracy, precision, and F1-score [34].The classwise tested results of the LSTM and GRU models are listed in Tables 1 and 2, respectively based on SGD [29] and Adam [25] optimizing methods.These are basically classwise precision, recall, F1-score, and accuracy.Classwise TP, FP, TN, and FN scores are also shown in the form of a 3×3 confusion matrix in Figures 5 and 6 Figure 5. Confusion matrix for LSTM model Figure 6.Confusion matrix for GRU model Figure 5 represents the confusion matrix of the LSTM model where the following record represents the records following "0: hello", "1: iamhungry" and "2: thank you" signs considering 6, 4, and 7 true positive records respectively.Again, Figure 6 represents the confusion matrix of the GRU model where the following record represents the records following "0: hello", "1: iamhungry" and "2: thank you" signs considering 3, 4, and 7 true positive records respectively.According to the investigation, the GRU model is not reliable for classifying the "hello" class whereas the LSTM model shows robustness for all of the classes.In Table 3, the overall performance of both of the models is analyzed and the average value of each performance measure is stored.The GRU model performs substantially worse on average than the LSTM model and LSTM is showing utmost performance with the Adam optimizing method which can be abundantly identified from Figure 7.
The 26×26 confusion matrix is calculated for the CNN model using predicted and actual outcomes.Again, the matrix includes TP, TN, FP, and FN values for all 26 classes that have been used to calculate classwise accuracy, precision, and recall for each of the 26 alphabet classes displayed in Table 4. Additionally, Table 5 gives the CNN model's overall performance report based on optimizing methods namely SGD, RMSprop [29], and Adam [34].Results can be clear from Figure 8 where the CNN model with Adam optimizing method is showing an increasing amount of precision but lowering the score of F1 measure.On average, accuracy of CNN model is 89.07%with Adam optimizing method.In contrast, the model optimised with SGD demonstrates the lowest levels of accuracy and F1-score.In Table 6, the performance comparisons among several existing methods are highlighted.From our analysis, it can also be included that although the gloves and devices give good results and a better understanding of recognizing the signs such as implementing cyber gloves [13] with the Hidden Markov model, but it is not cost-friendly for lots of people.Using a vision-based approach we can reduce the cost by only the trained dataset and model to recognize the sign with only captured devices such as a webcam or by using a droidcam (a third-party mobile camera to a webcam app) to capture from a device.Used method(s) Accuracy (%) [13] Cyber gloves and HMMs 95 [16] Leap motion devices 90.82 [20] Flex sensors 90.34 (precision score) [21] Data gloves and 3D position trackers 91.9 [22] One-handed glove-based system and HMMs 94 LSTM 94.3 Proposed work GRU 76 CNN 89.07

REAL-TIME ANALYSIS
In this section, it will be determined how our method does all of the operations that have been carried out in the actual scene in real-time analysis.Real-time analysis is necessary because without monitoring in real time it can not be figured out how the model will perform in the deployed states that the feature can be combined to form integrated benefits to the users.Figure 9 gives a glimpse of recognized sign language from the segmented images.When the hand is set to the region of the boxes, it captures and identifies the 50×50 images from the region where the green box is located.Figure 9. Sign language recognition using CNN model Then it converts or deciphers the hand sign language into an interpretable American sign alphabet which contains the letters from A to Z.Each time our system shows different signs on the region of the box, it takes the sign and interprets it continuously.This recognition has been performed using our proposed improved ResNet50-based CNN model.On the other hand, Figures 10-12 describe about real-time three significan types of sign language recognition via our proposed method assembled with LSTM method.Figure 10 demonstrates how our system reacts to the sign language action "hello".Figure 12 displays how our system reacts to the sign language action "thanks".Again, our system reacts to feeling of hungriness in the sign language action in real-time demonstrated in Figure 11.Mooreover, our proposed system successfully recognizes the sign language actions in each of the three scenarios in real-time.

CONCLUSION
In this study, a CNN and LSTM method-based sign language recognition system has been introduced.This system may help people who are deaf, mute, or do not know sign language communicate much better.Two datasets are created in this motive: one has 26 signs, and the other has three important 507 symbols with the necessary sequence of frames or movies for normal conversation.The improved ResNetbased CNN, LSTM, and GRU are examined in this paper.The Adam optimizer produces satisfactory results across the board.Using the first dataset to fit and evaluate the CNN model yields an accuracy of 89.07%.In contrast, the second dataset is given to LSTM and GRU, and a model evaluation comparison study is conducted.LSTM outperforms GRU in every discipline.LSTM has an accuracy of 94.3%, while GRU only achieves a success rate of 79.3%.The proposed system can accurately recognize and translate sign language into speech or text.This makes sign language recognition easier to use and more common, giving people in need a vital way to talk.As technology keeps getting better and more languages are added, the chances that these systems will improve the lives of deaf and mute people will only go up.In the future, adding more languages and using sign-to-text and text-to-speech tools will be possible.

Figure 1 .Figure 2 .
Figure 1.Block diagram of the proposed model for LSTM and GRU

3 )Figure 3 .
Figure 3.The basic block diagram of LSTM memory cell

)
_ = (_ * [ℎ_{ − 1}, ]) (5) ℎ_′ = ℎ(_ℎ * [_ * ℎ_{ − 1}, ]) (6) ℎ_ = (1 − _) * ℎ_{ − 1} + _ * ℎ_′ (7)where Wx_r, Wx_z, and Wx_h are weight matrices, sigmoid is the sigmoid activation function, tanh is the hyperbolic tangent activation function, Xt is the current input, h_{t-1} is the previous hidden state, r_t is the reset gate, z_t is the update gate, h_t' is the candidate hidden state, and h_t is the updated hidden state.These equations show how the GRU gates work together to control the flow of information from the current input and the previous hidden state to the current hidden state.The reset gate r_t determines which parts of the previous hidden state should be forgotten, and the update gate z_t determines how much of the current input should be used to update the hidden state.The candidate hidden state h_t' is calculated based on the resetgated previous hidden state and the current input, and is used to update the hidden state based on the update gate z_t.The structure of a memory cell is illustrated as a circuit diagram in Figure 4[31].

Figure 8 .
Figure 8. Performance analysis of CNN

Figure 10 .Figure 11 .Figure 12 .
Figure 10.Detecting sign language of "hello" Bulletin of Electr Eng & Inf ISSN: 2302-9285  An Adam based CNN and LSTM approach for sign language recognition in real … (Subrata Kumer Paul) 503data such as speech or text.It is similar to the more popular LSTM network but has fewer parameters and is easier to train.The GRU has a hidden state vector h that is updated at each time step t based on the current input Xt and the previous hidden state ht-1.It also has two gating mechanisms: the reset gate rt and the update gate zt.The reset gate determines how much of the previous hidden state should be forgotten, while the update gate determines how much of the current input should be used to update the hidden state.The GRU's reset gate r_t, update gate z_t, candidate hidden state h_t', and hidden state ht-1 can be computed by ( An Adam based CNN and LSTM approach for sign language recognition in real … (Subrata KumerPaul)

Table 3 .
Comparative analysis of the overall performance of LSTM and GRU model An Adam based CNN and LSTM approach for sign language recognition in real … (Subrata Kumer Paul) 505

Table 4 .
Classwise accuracy, precision and recall for CNN model

Table 5 .
Performance report of CNN model

Table 6 .
Comparative analysis of the related and existing approachesRef.